Your app takes off, CPU pins at 100%, memory burns, and one host is no longer enough. This is where every sysadmin turns into a systems architect: find the bottleneck, relieve it, repeat. The classic next step looks like the diagram above: a load balancer in front of multiple app servers (web tier), and a database with read replicas (data tier). Here’s how to get there without setting production on fire.


1) Two Ways to Grow

StrategyWhat it isProsCons
Vertical scalingAdd more CPU/RAM/IO to the same hostOperationally simple; no topology changeFinite ceiling, costly hardware, bigger blast radius
Horizontal scalingAdd more serversResilience, elasticity, better unit economicsComplexity: balancing, state, consistency, networking

Rule of thumb: start vertical as long as the ROI is obvious. When your p95 latency plus sustained CPU/RAM show a physical ceiling, switch to horizontal.


2) The Load Balancer: Who Gets Each Request?

A Load Balancer (LB) answers “which server should handle this request?”

  • L4 (TCP/UDP): fast, content-agnostic (IPVS/LVS, HAProxy TCP, cloud NLB).
  • L7 (HTTP/HTTPS/gRPC): understands headers/paths, can do TLS termination, path/host routing, WAF (Nginx, HAProxy HTTP, Envoy, Traefik).

Common algorithms

  • Round-robin (simple, good default).
  • Least-connections (good when sessions vary).
  • Weighted (mix different instance sizes).
  • Consistent-hash (sticky by key; partition caches).

Health & outlier detection

  • Active health checks (HTTP 200/3xx, TCP, gRPC).
  • Circuit breaking/outlier ejection to evict flapping instances.
  • Timeouts & bounded retries with backoff to avoid retry storms.

Sessions: sticky or stateless?

  • Aim for stateless apps (session in Redis or signed JWT/cookies).
  • If not yet possible, use stickiness (LB cookie or IP-hash). Know this reduces true balancing and complicates failover.

Minimal HAProxy (L7) example

frontend fe_https
  bind :443 ssl crt /etc/haproxy/certs/site.pem
  mode http
  option httplog
  default_backend be_app

backend be_app
  mode http
  balance leastconn
  option httpchk GET /health
  http-check expect status 200
  server app1 10.0.1.11:8080 check
  server app2 10.0.1.12:8080 check
  server app3 10.0.1.13:8080 check

LB high availability

  • Active/standby with VRRP/Keepalived (floating VIP) or anycast; in cloud, use managed ELB/ALB/NLB across AZs.
  • Practice failovers: VIP takeover, connection drain, state handling.

3) The Bottleneck Moves: Your Database

Balancing the web tier multiplies concurrent DB clients. If every instance hits one database, the bottleneck slides to the DB. Before throwing hardware, squeeze three knobs:

  1. Connection pooling & proxies
    • Postgres: pgBouncer (transaction mode for chatty apps).
    • MySQL/MariaDB: ProxySQL.
    • Too many connections hurt throughput—tune limits.
  2. Indexes & queries
    • EXPLAIN/ANALYZE, composite indexes, avoid N+1.
    • Watch locks, checkpoints, and vacuum (PG).
  3. Caching (next section).

When you still need more, add replication:

Primary + Read Replicas

  • Primary handles writes;
  • Replicas serve reads.

Sync model

  • Asynchronous (typical): low latency, potential replication lageventual consistency.
  • Synchronous: near-zero RPO, higher latency and stall risk.

Dealing with lag (read-your-writes)

  • Route post-write reads for that user to primary for a short window.
  • LSN/GTID-based reads: app carries last seen LSN/GTID; a proxy only uses replicas that have caught up (PG: pg_last_wal_replay_lsn()).
  • Clear read/write routing in the pool; block non-safe queries on replicas.

Primary failover

  • Orchestrate with Patroni, repmgr (PG) or orchestrator (MySQL).
  • Decide RTO/RPO; prevent split-brain.

4) Cache: Your Best Defense Against Saturation

Where to cache

  • CDN/edge for static and cacheable API responses.
  • Reverse proxy (Nginx/Envoy) for page or fragment caching.
  • App-level cache with Redis/Memcached for objects and query results.

Patterns

  • Cache-aside (lazy-load): app checks cache, falls back to DB, then populates cache. Most flexible.
  • Write-through: write to DB and cache—better consistency, slower writes.
  • Write-behind: write to cache, persist later—faster writes, higher risk.

TTL & invalidation

  • Use sensible TTLs; for strict consistency, explicitly invalidate (pub/sub, tags).
  • Choose eviction (LRU/LFU) based on access pattern; watch for hot keys (consider sharding).

Dogpile/stampede protection

  • Prevent a thundering herd when a hot key expires: locking (e.g., SETNX + expiry) or request coalescing at the proxy.
  • Early refresh before TTL hits zero.

Cache-aside pseudo-code

def get_user(uid):
    key = f"user:{uid}"
    blob = redis.get(key)
    if blob:
        return deserialize(blob)

    row = db.query("SELECT * FROM users WHERE id=%s", uid)
    if row:
        redis.setex(key, 300, serialize(row))  # 5-minute TTL
    return row
Code language: PHP (php)

5) When Replicas Aren’t Enough: Sharding & Partitioning

If the DB (or a single table) explodes in size/IO, consider partitioning:

  • Sharding by tenant/user (hash or range): each shard is a smaller DB with its own primary/replicas.
  • Native partitioning by date/range (PG native, MySQL has limits).
  • CQRS: separate read and write models.
  • Queues/events (Kafka/RabbitMQ/SQS) for heavy asynchronous work; ensure idempotency (Outbox pattern).

Caveat: sharding complicates joins, transactions, and multi-key ops. Don’t go there until you’ve exhausted replicas + cache + indexes.


6) Resilience Patterns: Timeouts, Retries, Backoff, Breakers

  • Timeouts everywhere (client, LB, app, DB, cache). Without timeouts, failures become stuck threads.
  • Retries with exponential backoff + jitter; never retry non-idempotent operations (or use idempotency keys).
  • Circuit breakers: open the circuit after repeated failures to protect downstream systems.
  • Rate limiting and WAF at L7.
  • Queues to absorb spikes (buffering).

7) Observability & Capacity: No Metrics, No Scaling

Define SLOs (e.g., p95 < 300 ms, error rate < 1%). Instrument:

  • LB: RPS, backend errors, outlier ejections, latency.
  • App: p50/p95/p99 per endpoint, thread pool saturation.
  • DB: QPS, locks, slow queries, replication lag, active connections.
  • Cache: hit ratio, memory, evictions, hot keys.
  • Infra: CPU, IOPS, network, throttling.

Use distributed tracing (OTel/Jaeger/Tempo) across LB → app → DB/cache. Build dashboards that span layers; alert on trends, not only static thresholds.


8) Safe Deploys: Blue/Green and Canary

  • Blue/Green: two identical stacks; switch traffic via LB, instant rollback.
  • Canary: ship to 5% → 25% → 50% while watching SLOs and error budgets.
  • Auto-scaling: policies on CPU/RPS/latency; handle short spikes with warm-ups.

9) Security in the Architecture

  • TLS termination at LB, consider mTLS inside if your threat model demands it.
  • Secret management (no plain env), rotation.
  • Least privilege for DB/cache; segmented security groups.
  • Backups with tested RPO/RTO; DR across AZ/region.

10) Evolution Checklist

  1. Vertical scale; fix hot queries/code.
  2. LB + 2–3 app nodes; health checks, least-conn.
  3. Connection pooling (pgBouncer/ProxySQL).
  4. Redis cache (cache-aside) + CDN for static.
  5. Read replicas; clear read/write routing; read-your-writes.
  6. Full observability, SLOs, canary deploys.
  7. LB HA (VRRP/ELB) + auto-scaling.
  8. If needed: partitioning/sharding; queues for async.
  9. DR multi-AZ/region, regular failover drills.

A Final Field Note

Horizontal scale buys headroom, but every step introduces trade-offs: stickiness vs. stateless, latency vs. consistency, cost vs. simplicity. The real job is choosing deliberately what to sacrifice and measuring whether the choice pays off. Designing systems is exactly that: find today’s bottleneck and put in place what you need so tomorrow your app can keep growing—without your on-call going up in flames.

Scroll to Top