At a time when “scale” is often synonymous with sharding everything and building increasingly complex distributed systems, OpenAI is describing a notably restrained database architecture behind ChatGPT: a single PostgreSQL primary for writes and roughly 50 read replicas spread across regions.

In a recently published engineering write-up, OpenAI explains how it stretched that setup to support 800 million ChatGPT users, leaning less on exotic technology and more on operational discipline: aggressive read offloading, connection pooling, cache coordination, and careful workload isolation. The company’s point is not that PostgreSQL is limitless, but that a mature relational database can go much farther than many teams assume—especially when the workload is predominantly read-heavy and the “blast radius” of spikes is actively contained.

Why a simple architecture becomes fragile at scale

The core risk with “one primary + many replicas” is obvious: when everything depends on one write node, any upstream turbulence can cascade into the data layer.

OpenAI highlights several recurring triggers that can turn an otherwise stable system into an incident:

  • Cache misses at scale: if a popular cache key expires or gets invalidated, thousands of requests can stampede the database at once.
  • Expensive queries: a small number of multi-table joins can saturate CPU and push latency into timeouts for unrelated traffic.
  • Feature launches: new capabilities often produce short-lived bursts of write-heavy activity that a system designed around reads may struggle to absorb.

In other words, the architecture can be “simple,” but the failure modes are not—unless the platform is engineered to keep spikes from synchronizing into a single, system-wide surge.

The write problem: why OpenAI starts pushing shardable workloads out

OpenAI’s write-up frames reads as a largely “solved” scaling problem (add replicas), while acknowledging that writes are the harder ceiling for a single-primary design.

One reason is how PostgreSQL handles concurrency: updates create new row versions under MVCC (multi-version concurrency control) rather than performing in-place rewrites. That can lead to:

  • Write amplification (more data churn per logical write)
  • More cleanup pressure (vacuuming / dead tuple cleanup)
  • Secondary read penalties when the system is under sustained write-heavy load

OpenAI’s response, as described, is pragmatic: identify workloads that can be cleanly partitioned and move those shardable, write-intensive components to a sharded database—with the write-up and commentary noting Azure’s Cosmos DB as the destination for those pieces. Existing, highly relational or “hard to shard” domains can remain on PostgreSQL, while newer workloads are encouraged—sometimes “forced”—to start life on the sharded path.

The playbook: how they reduce pressure on the primary

OpenAI’s approach combines multiple “small” interventions that add up to meaningful headroom.

1) Fix redundant writes and adopt “lazy writes”

The simplest gains often come from removing unnecessary work. OpenAI describes finding bugs that caused redundant writes and eliminating them.

They also describe moving toward lazy/asynchronous writes where real-time persistence is not required—batching or deferring writes so the database sees fewer bursts and more predictable write patterns.

2) Rate limiting to protect the database

When the platform experiences sudden demand spikes—especially around launches—OpenAI describes using strict rate limits to prevent the primary from being overwhelmed by short-term write storms. This is less about maximizing throughput and more about preserving system stability and keeping the failure mode from becoming catastrophic.

3) Query rewrites: breaking “monster joins” into safer patterns

OpenAI calls out the familiar lesson: one bad query can be a platform-wide tax. The example discussed is a query joining 12 tables—the kind of operation that can crush CPU if it becomes hot.

Their mitigation is also familiar at scale: move parts of the logic into the application layer and replace a single massive join with multiple reads that can be:

  • served from replicas,
  • served from cache, or
  • executed with better isolation and control.

They also emphasize auditing ORM-generated queries rather than trusting them blindly—watching for patterns such as inefficient access paths, poor index usage, or classic “N+1” behavior.

Read scaling: replicas, but with real constraints

OpenAI’s architecture leans heavily on the idea that most production traffic is read-heavy, and that the easiest way to scale reads is to add replicas.

However, they also highlight a constraint that often gets ignored in “just add replicas” advice: replication isn’t free. Even if replicas are read-only, the primary must still stream changes to them, which consumes resources and can impose practical limits on how many replicas can be supported directly.

To extend beyond that ceiling, OpenAI describes testing cascading replication: instead of every replica streaming from the primary, intermediate replicas receive changes from the primary and then fan out to additional replicas. The upside is the ability to scale read capacity further without putting all replication load on the primary. The trade-off is operational complexity, especially around failovers and consistency behaviors.

The plumbing that matters: PgBouncer and cache locking

Two of the most concrete techniques described are not specific to OpenAI—but the write-up underscores how essential they become at this scale.

PgBouncer for connection pooling

PostgreSQL connections are expensive, and large fleets of application servers can create connection storms—especially when services restart, autoscale, or experience cascading timeouts. OpenAI describes using PgBouncer as a proxy layer so that many client sessions can reuse a smaller pool of database connections.

In the discussion, this improves both stability and latency by avoiding repeated connection setup overhead, and by ensuring the database isn’t forced to manage an unbounded number of concurrent sessions.

Cache locking to stop thundering herds

To prevent cache misses from stampeding the database, OpenAI describes cache locking: when many requests miss the same cache key, only one request is allowed to go to PostgreSQL to repopulate the value while others wait. This “single flight” pattern keeps a momentary cache gap from becoming a massive read surge against the database.

Workload isolation: avoiding the noisy neighbor effect

A subtle but important tactic in OpenAI’s write-up is routing different use cases to specific replicas, rather than letting any request hit any node.

The goal is to prevent “noisy” traffic—new features, low-priority endpoints, exploratory workloads—from degrading critical paths. It’s a reminder that scaling isn’t just capacity; it’s control: who gets to consume that capacity, and under which conditions.

How OpenAI Handles 800 Million ChatGPT Users on a Single PostgreSQL Primary

source: Scaling PostgreSQL

Scroll to Top