A high-performance Ceph cluster has achieved a sustained 1 TiB/s sequential read throughput — the highest publicly reported to date — demonstrating that open-source distributed storage can match or surpass proprietary, high-cost systems in raw speed and scalability.

The milestone was the result of a carefully engineered combination of cutting-edge hardware, high-bandwidth networking, and meticulous software tuning to overcome critical bottlenecks.


From Legacy HDD to All-NVMe: The Path to Performance

The project began in 2023 with the goal of replacing a legacy HDD-based Ceph cluster with a 10 PB all-NVMe architecture designed for maximum performance. Working in collaboration with Clyso, engineers deployed 68 Dell PowerEdge R6615 nodes powered by AMD EPYC 9454P processors (48 cores / 96 threads), 192 GiB DDR5 RAM, dual Mellanox ConnectX-6 100 GbE network interfaces, and 10 × 15.36 TB enterprise NVMe drives per node.

The final build comprised 630 OSDs spread across 17 racks, running Ceph Quincy v17.2.7 on Ubuntu 20.04.6. The network fabric — already optimized for mission-critical workloads — played a decisive role in enabling the record-breaking performance.


Three Bottlenecks and How They Were Crushed

Early performance tests fell short of expectations. Deep analysis revealed three major constraints:

  1. CPU Latency from C-States
    Power-saving modes in the processors introduced latency penalties. Disabling C-states in the BIOS yielded an immediate 10–20% performance boost.
  2. IOMMU Contention
    The kernel was spending excessive time managing DMA mappings for NVMe devices. Turning off IOMMU released additional performance headroom.
  3. Unoptimized RocksDB Compilation
    Stock Ceph builds on Debian/Ubuntu did not include the correct compiler flags for RocksDB. Recompiling with optimized flags tripled compaction speeds and doubled 4K random write performance.

Record-Setting Metrics

Once tuned, the cluster delivered:

  • 1.025 TiB/s sequential read throughput (4 MB block size, 3× replication)
  • 270 GiB/s sequential write throughput (3× replication)
  • 25.5 million IOPS for 4K random reads
  • With erasure coding (6+2), over 500 GiB/s sequential reads and 387 GiB/s writes

The key was balancing the number of clients with OSD scaling, optimizing placement groups (PGs), threads, and shard configurations to avoid “laggy” states under extreme load.


Why This Matters for Ceph at Scale

For years, Ceph has been associated with flexibility and cost efficiency, but not necessarily with the top-tier performance seen in specialized storage appliances. This deployment proves that — in the right hands — Ceph can deliver unprecedented throughput and IOPS without sacrificing its open-source advantages.

It also highlights the importance of pairing the right hardware architecture with deep operational expertise, especially for large-scale AI, HPC, or cloud workloads where predictable high performance is critical.


Industry Perspective: Stackscale on Ceph and Enterprise Storage Strategy

David Carrero, co-founder of Stackscale (part of Grupo Aire), emphasizes that Ceph’s true potential lies in its adaptability to real-world enterprise requirements.

“At Stackscale, we enable our customers to deploy Ceph on dedicated infrastructure, either integrated with Proxmox environments or as part of fully customized high-performance architectures. In addition to Ceph, we also offer advanced network storage solutions based on NetApp technology, including a synchronous replication setup between two Madrid data centers with RTO=0 and RPO=0. This combination allows us to deliver the ideal balance of performance, resilience, and availability, tailored to each client’s needs.”

Carrero notes that well-implemented Ceph is a strategic enabler for organizations seeking technological independence, full control over their data, and optimized cost structures without sacrificing performance.


Key Technical Specs

Metric3× ReplicationEC 6+2
Sequential Read (4 MB)1.025 TiB/s547 GiB/s
Sequential Write (4 MB)270 GiB/s387 GiB/s
4K Random Read25.5 M IOPS3.4 M IOPS
4K Random Write4.9 M IOPS936 K IOPS

Beyond the Benchmark: The Road Ahead

While the numbers are impressive, engineers see room for improvement — particularly in scaling write performance further and increasing per-node IOPS ceilings. This could involve deeper changes to Ceph’s OSD internals and more intelligent data placement strategies.

For the enterprise market, the lesson is clear: open-source distributed storage is no longer just a cost-saving alternative. In the right configurations, it is a performance leader.


FAQs

1. What is Ceph and why is it significant?
Ceph is an open-source, software-defined storage platform providing block, object, and file services with high scalability and fault tolerance.

2. Why were AMD EPYC processors chosen?
Their high core counts, DDR5 bandwidth, and energy efficiency made them ideal for handling hundreds of OSDs per node.

3. What is Stackscale’s synchronous NetApp storage solution?
A real-time replication service between two Madrid data centers ensuring zero downtime (RTO=0) and zero data loss (RPO=0).

4. Can Ceph be combined with other enterprise storage systems?
Yes. Stackscale supports hybrid designs, integrating Ceph with high-availability NetApp storage for optimal flexibility and resilience.

vía: ceph.io y Revista Cloud

Scroll to Top