When One Box Isn’t Enough: A Practical Playbook for Scaling with Load Balancers, Read Replicas, and Caching

Published 11/18/2025

X (Twitter) Facebook Pinterest LinkedIn Email WhatsApp

Your app takes off, CPU pins at 100%, memory burns, and one host is no longer enough. This is where every sysadmin turns into a systems architect: find the bottleneck, relieve it, repeat. The classic next step looks like the diagram above: a load balancer in front of multiple app servers (web tier), and a database with read replicas (data tier). Here’s how to get there without setting production on fire.

1) Two Ways to Grow

Strategy	What it is	Pros	Cons
Vertical scaling	Add more CPU/RAM/IO to the same host	Operationally simple; no topology change	Finite ceiling, costly hardware, bigger blast radius
Horizontal scaling	Add more servers	Resilience, elasticity, better unit economics	Complexity: balancing, state, consistency, networking

Rule of thumb: start vertical as long as the ROI is obvious. When your p95 latency plus sustained CPU/RAM show a physical ceiling, switch to horizontal.

2) The Load Balancer: Who Gets Each Request?

A Load Balancer (LB) answers “which server should handle this request?”

L4 (TCP/UDP): fast, content-agnostic (IPVS/LVS, HAProxy TCP, cloud NLB).
L7 (HTTP/HTTPS/gRPC): understands headers/paths, can do TLS termination, path/host routing, WAF (Nginx, HAProxy HTTP, Envoy, Traefik).

Common algorithms

Round-robin (simple, good default).
Least-connections (good when sessions vary).
Weighted (mix different instance sizes).
Consistent-hash (sticky by key; partition caches).

Health & outlier detection

Active health checks (HTTP 200/3xx, TCP, gRPC).
Circuit breaking/outlier ejection to evict flapping instances.
Timeouts & bounded retries with backoff to avoid retry storms.

Sessions: sticky or stateless?

Aim for stateless apps (session in Redis or signed JWT/cookies).
If not yet possible, use stickiness (LB cookie or IP-hash). Know this reduces true balancing and complicates failover.

Minimal HAProxy (L7) example

frontend fe_https
  bind :443 ssl crt /etc/haproxy/certs/site.pem
  mode http
  option httplog
  default_backend be_app

backend be_app
  mode http
  balance leastconn
  option httpchk GET /health
  http-check expect status 200
  server app1 10.0.1.11:8080 check
  server app2 10.0.1.12:8080 check
  server app3 10.0.1.13:8080 check

LB high availability

Active/standby with VRRP/Keepalived (floating VIP) or anycast; in cloud, use managed ELB/ALB/NLB across AZs.
Practice failovers: VIP takeover, connection drain, state handling.

3) The Bottleneck Moves: Your Database

Balancing the web tier multiplies concurrent DB clients. If every instance hits one database, the bottleneck slides to the DB. Before throwing hardware, squeeze three knobs:

Connection pooling & proxies
- Postgres: pgBouncer (transaction mode for chatty apps).
- MySQL/MariaDB: ProxySQL.
- Too many connections hurt throughput—tune limits.
Indexes & queries
- EXPLAIN/ANALYZE, composite indexes, avoid N+1.
- Watch locks, checkpoints, and vacuum (PG).
Caching (next section).

When you still need more, add replication:

Primary + Read Replicas

Primary handles writes;
Replicas serve reads.

Sync model

Asynchronous (typical): low latency, potential replication lag → eventual consistency.
Synchronous: near-zero RPO, higher latency and stall risk.

Dealing with lag (read-your-writes)

Route post-write reads for that user to primary for a short window.
LSN/GTID-based reads: app carries last seen LSN/GTID; a proxy only uses replicas that have caught up (PG: pg_last_wal_replay_lsn()).
Clear read/write routing in the pool; block non-safe queries on replicas.

Primary failover

Orchestrate with Patroni, repmgr (PG) or orchestrator (MySQL).
Decide RTO/RPO; prevent split-brain.

4) Cache: Your Best Defense Against Saturation

Where to cache

CDN/edge for static and cacheable API responses.
Reverse proxy (Nginx/Envoy) for page or fragment caching.
App-level cache with Redis/Memcached for objects and query results.

Patterns

Cache-aside (lazy-load): app checks cache, falls back to DB, then populates cache. Most flexible.
Write-through: write to DB and cache—better consistency, slower writes.
Write-behind: write to cache, persist later—faster writes, higher risk.

TTL & invalidation

Use sensible TTLs; for strict consistency, explicitly invalidate (pub/sub, tags).
Choose eviction (LRU/LFU) based on access pattern; watch for hot keys (consider sharding).

Dogpile/stampede protection

Prevent a thundering herd when a hot key expires: locking (e.g., SETNX + expiry) or request coalescing at the proxy.
Early refresh before TTL hits zero.

Cache-aside pseudo-code

def get_user(uid):
    key = f"user:{uid}"
    blob = redis.get(key)
    if blob:
        return deserialize(blob)

    row = db.query("SELECT * FROM users WHERE id=%s", uid)
    if row:
        redis.setex(key, 300, serialize(row))  # 5-minute TTL
    return row
Code language: PHP (php)

5) When Replicas Aren’t Enough: Sharding & Partitioning

If the DB (or a single table) explodes in size/IO, consider partitioning:

Sharding by tenant/user (hash or range): each shard is a smaller DB with its own primary/replicas.
Native partitioning by date/range (PG native, MySQL has limits).
CQRS: separate read and write models.
Queues/events (Kafka/RabbitMQ/SQS) for heavy asynchronous work; ensure idempotency (Outbox pattern).

Caveat: sharding complicates joins, transactions, and multi-key ops. Don’t go there until you’ve exhausted replicas + cache + indexes.

6) Resilience Patterns: Timeouts, Retries, Backoff, Breakers

Timeouts everywhere (client, LB, app, DB, cache). Without timeouts, failures become stuck threads.
Retries with exponential backoff + jitter; never retry non-idempotent operations (or use idempotency keys).
Circuit breakers: open the circuit after repeated failures to protect downstream systems.
Rate limiting and WAF at L7.
Queues to absorb spikes (buffering).

7) Observability & Capacity: No Metrics, No Scaling

Define SLOs (e.g., p95 < 300 ms, error rate < 1%). Instrument:

LB: RPS, backend errors, outlier ejections, latency.
App: p50/p95/p99 per endpoint, thread pool saturation.
DB: QPS, locks, slow queries, replication lag, active connections.
Cache: hit ratio, memory, evictions, hot keys.
Infra: CPU, IOPS, network, throttling.

Use distributed tracing (OTel/Jaeger/Tempo) across LB → app → DB/cache. Build dashboards that span layers; alert on trends, not only static thresholds.

8) Safe Deploys: Blue/Green and Canary

Blue/Green: two identical stacks; switch traffic via LB, instant rollback.
Canary: ship to 5% → 25% → 50% while watching SLOs and error budgets.
Auto-scaling: policies on CPU/RPS/latency; handle short spikes with warm-ups.

9) Security in the Architecture

TLS termination at LB, consider mTLS inside if your threat model demands it.
Secret management (no plain env), rotation.
Least privilege for DB/cache; segmented security groups.
Backups with tested RPO/RTO; DR across AZ/region.

10) Evolution Checklist

Vertical scale; fix hot queries/code.
LB + 2–3 app nodes; health checks, least-conn.
Connection pooling (pgBouncer/ProxySQL).
Redis cache (cache-aside) + CDN for static.
Read replicas; clear read/write routing; read-your-writes.
Full observability, SLOs, canary deploys.
LB HA (VRRP/ELB) + auto-scaling.
If needed: partitioning/sharding; queues for async.
DR multi-AZ/region, regular failover drills.

A Final Field Note

Horizontal scale buys headroom, but every step introduces trade-offs: stickiness vs. stateless, latency vs. consistency, cost vs. simplicity. The real job is choosing deliberately what to sacrifice and measuring whether the choice pays off. Designing systems is exactly that: find today’s bottleneck and put in place what you need so tomorrow your app can keep growing—without your on-call going up in flames.