The temptation to build the "proper" cluster on day one is strong — and almost always harmful. You pay in complexity for load that isn't there yet: orchestration, network latency, distributed bugs, costlier operations.
Start with metrics, not hunches
A single well-tuned server handles surprisingly much. Before scaling, set measurable metrics: p95 latency, CPU load, queue depth, DB response time. Decisions come from graphs, not from a feeling that "it's probably time".
Often it isn't the server that's slow but one heavy query without an index, or an N+1 in the ORM. Profiling can save you an entire cluster.
The order of scaling
First, go vertical (more CPU/RAM) and grab cheap wins: cache (Redis), indexes, a connection pool (PgBouncer), static on a CDN. Then a read replica. And only when you genuinely hit the ceiling — horizontal scaling and sharding.
Every new node must solve a measured problem, not a hypothetical one.
Horizontal scaling adds not just capacity but classes of problems: consistency, idempotency, distributed transactions. Take them on deliberately.
A pragmatic default for most products: one server in Docker, a DB replica, Redis and backups. That carries you to serious volumes — and by then you'll know exactly what hit the limit.