For years, plenty of architecture discussions have leaned on a convenient idea: that SQL databases “don’t scale,” and that beyond a certain size the natural move is to abandon the relational model. In practice, what breaks first is rarely SQL itself. What breaks is the operational discipline around the database: schema design, query hygiene, connection limits, replica management, backup-and-restore posture, and the ability to absorb traffic spikes without turning them into incidents.
For systems administrators, the most useful reference cases aren’t the ones that brag about novelty. They’re the ones that demonstrate repeatable patterns. Shopify, for example, has described how it keeps its “distributed monolith” on MySQL, using a sharded architecture designed to scale under peaks like Black Friday / Cyber Monday: hundreds of shards, one writer per shard, and multiple replicas for reads and resiliency—deployed across thousands of virtual machines to sustain growth and campaigns.
In parallel, Meta has long taken a pragmatic approach: MySQL at the core, with selective changes where it matters. Its public fork includes MyRocks, a RocksDB-based storage engine intended to improve efficiency (especially under write-heavy workloads or when storage pressure is a major factor).
And for anyone who equates “scaling” with “distribution,” the ecosystem also offers SQL-first paths: Vitess, now under the CNCF umbrella, was created to scale MySQL horizontally (sharding, routing, and large-scale operations) and became widely adopted because it addressed that exact pain at high volume.
Even outside the marquee platforms—yet still operating under brutal production requirements—the lesson holds. Wikimedia documents its use of MariaDB in production, along with the operational tooling and practices that come with it.
For a sysadmin-focused outlet, the conclusion is less philosophical and more actionable: SQL scales if you operate it well. And “operate it well” isn’t a slogan—it’s a checklist of controls and decisions you can audit.
What Usually Scales First: Habits, Not Technology
In production, the jump from “it’s fine” to “it’s down” is rarely caused by the SQL engine alone. The usual triggers are familiar:
- Connection explosions (poorly sized pools, aggressive timeouts, retry storms).
- Unbounded query growth (missing indexes, expensive joins, N+1 patterns).
- Hotspots (a central table or a small set of rows concentrating writes).
- Backups that don’t restore (copies exist, but recovery is untested).
- Replication without visibility (hidden lag, manual promotions under pressure).
That’s why in “SQL vs. NoSQL” debates, sysadmins benefit from reframing the question: what pattern do I need to meet my SLOs and my operational model?
Comparison Table: SQL Patterns That Actually Work Under Real Load
| Operational need | Recommended pattern | Typical fit | Sysadmin advantages | Risks / trade-offs |
|---|---|---|---|---|
| More reads without changing the app | Read replicas + RW/RO separation | Light analytics, catalogs, feeds | Reduces primary load; incremental scaling | Replication lag; inconsistent reads if not designed carefully |
| Predictable spikes (campaigns) | Capacity planning + temporary scaling + caches | E-commerce, media, SaaS | Risk control; clear runbooks | Infra cost if overprovisioned |
| Sustained data/IO growth | Partitioning + archiving | Logs, events, long-term history | Keeps “hot” tables small; easier maintenance | Query complexity; partition ops overhead |
| Write ceiling on a single primary | Sharding by a business key | Multi-tenant, large catalogs | Parallelizes writes; reduces hotspots | Complexity for cross-shard transactions and global reporting |
| Large-scale ops (many shards) | Orchestration layer (e.g., Vitess) | High-volume platforms | Standardizes routing/ops; automates recurring work | Learning curve; more critical components |
| Cost/licensing + MySQL compatibility | MariaDB (case-dependent) | MySQL-compatible environments | Broad ecosystem; clustering options | Compatibility drift; internal standardization decisions |
What to Actually Copy from the Big Players
1) Separate performance from heroics.
Shopify doesn’t treat its database as a sacred artifact. It treats it as a system made of repeatable parts: shards, clear roles (writer/replicas), and a playbook that gets pressure-tested during major campaigns. In production terms, that means runbooks, rehearsals, and automation—so scalability doesn’t depend on “the one person who knows everything.”
2) Change only what you must.
MyRocks is a good example of pragmatism: it’s not “reinvent the database,” it’s “adjust the storage engine to relieve a specific bottleneck,” while keeping the SQL model.
3) Operate for the worst day, not the average day.
Architecture value isn’t proven on a quiet Tuesday; it’s proven the day traffic multiplies and the business won’t accept degradation. In that scenario, sysadmins need three things: observability (lag, QPS, latency, locks), admission control (rate limiting, queues, circuit breakers), and restore drills that are practiced—not theoretical.
A Note on “Magic Numbers”
You’ll often see claims like “millions of queries per second” attached to large platforms. They’re useful as a rough signal, but dangerous as an objective. The number depends on what counts as a query, the read/write mix, caching, and how measurement is instrumented. In real-world operations, the metrics that matter are different: p95/p99 latency under load, recovery time, and stability during change (migrations, failovers, reindexing).
Frequently Asked Questions
What are the clearest signs a SQL database is hitting operational limits?
Sustained increases in p95/p99 latency, CPU or IO saturation, rising replication lag, frequent locking, connection queueing, and maintenance windows that start causing noticeable performance degradation (backups, reindexing).
When does sharding make sense for MySQL or MariaDB?
When writes to a single primary become a structural bottleneck, or when hotspots can’t be solved with indexing, partitioning, caching, and tuning. It often fits multi-tenant architectures with a natural sharding key.
Does Vitess replace MySQL?
No. Vitess acts as an operational and scaling layer for MySQL (sharding, routing, large-scale operations), while MySQL remains the underlying storage engine.
Is MariaDB a drop-in replacement for MySQL 8 in production?
It depends on your feature set (SQL features, storage engines, replication behavior, and tooling). It can be viable in many scenarios, but compatibility and operational behavior should be validated in practice; Wikimedia’s production documentation is a useful reference for real-world patterns and tooling.
vía: Large-Scale SQL
