Ceph Breaks Records: AMD EPYC Deployment Delivers Sustained 1 TiB/s Throughput, Setting a New Benchmark for Distributed Storage

Published 08/15/2025

X (Twitter) Facebook Pinterest LinkedIn Email WhatsApp

A high-performance Ceph cluster has achieved a sustained 1 TiB/s sequential read throughput — the highest publicly reported to date — demonstrating that open-source distributed storage can match or surpass proprietary, high-cost systems in raw speed and scalability.

The milestone was the result of a carefully engineered combination of cutting-edge hardware, high-bandwidth networking, and meticulous software tuning to overcome critical bottlenecks.

From Legacy HDD to All-NVMe: The Path to Performance

The project began in 2023 with the goal of replacing a legacy HDD-based Ceph cluster with a 10 PB all-NVMe architecture designed for maximum performance. Working in collaboration with Clyso, engineers deployed 68 Dell PowerEdge R6615 nodes powered by AMD EPYC 9454P processors (48 cores / 96 threads), 192 GiB DDR5 RAM, dual Mellanox ConnectX-6 100 GbE network interfaces, and 10 × 15.36 TB enterprise NVMe drives per node.

The final build comprised 630 OSDs spread across 17 racks, running Ceph Quincy v17.2.7 on Ubuntu 20.04.6. The network fabric — already optimized for mission-critical workloads — played a decisive role in enabling the record-breaking performance.

Three Bottlenecks and How They Were Crushed

Early performance tests fell short of expectations. Deep analysis revealed three major constraints:

CPU Latency from C-States
Power-saving modes in the processors introduced latency penalties. Disabling C-states in the BIOS yielded an immediate 10–20% performance boost.
IOMMU Contention
The kernel was spending excessive time managing DMA mappings for NVMe devices. Turning off IOMMU released additional performance headroom.
Unoptimized RocksDB Compilation
Stock Ceph builds on Debian/Ubuntu did not include the correct compiler flags for RocksDB. Recompiling with optimized flags tripled compaction speeds and doubled 4K random write performance.

Record-Setting Metrics

Once tuned, the cluster delivered:

1.025 TiB/s sequential read throughput (4 MB block size, 3× replication)
270 GiB/s sequential write throughput (3× replication)
25.5 million IOPS for 4K random reads
With erasure coding (6+2), over 500 GiB/s sequential reads and 387 GiB/s writes

The key was balancing the number of clients with OSD scaling, optimizing placement groups (PGs), threads, and shard configurations to avoid “laggy” states under extreme load.

Why This Matters for Ceph at Scale

For years, Ceph has been associated with flexibility and cost efficiency, but not necessarily with the top-tier performance seen in specialized storage appliances. This deployment proves that — in the right hands — Ceph can deliver unprecedented throughput and IOPS without sacrificing its open-source advantages.

It also highlights the importance of pairing the right hardware architecture with deep operational expertise, especially for large-scale AI, HPC, or cloud workloads where predictable high performance is critical.

Industry Perspective: Stackscale on Ceph and Enterprise Storage Strategy

David Carrero, co-founder of Stackscale (part of Grupo Aire), emphasizes that Ceph’s true potential lies in its adaptability to real-world enterprise requirements.

“At Stackscale, we enable our customers to deploy Ceph on dedicated infrastructure, either integrated with Proxmox environments or as part of fully customized high-performance architectures. In addition to Ceph, we also offer advanced network storage solutions based on NetApp technology, including a synchronous replication setup between two Madrid data centers with RTO=0 and RPO=0. This combination allows us to deliver the ideal balance of performance, resilience, and availability, tailored to each client’s needs.”

Carrero notes that well-implemented Ceph is a strategic enabler for organizations seeking technological independence, full control over their data, and optimized cost structures without sacrificing performance.

Key Technical Specs

Metric	3× Replication	EC 6+2
Sequential Read (4 MB)	1.025 TiB/s	547 GiB/s
Sequential Write (4 MB)	270 GiB/s	387 GiB/s
4K Random Read	25.5 M IOPS	3.4 M IOPS
4K Random Write	4.9 M IOPS	936 K IOPS

Beyond the Benchmark: The Road Ahead

While the numbers are impressive, engineers see room for improvement — particularly in scaling write performance further and increasing per-node IOPS ceilings. This could involve deeper changes to Ceph’s OSD internals and more intelligent data placement strategies.

For the enterprise market, the lesson is clear: open-source distributed storage is no longer just a cost-saving alternative. In the right configurations, it is a performance leader.

FAQs

1. What is Ceph and why is it significant?
Ceph is an open-source, software-defined storage platform providing block, object, and file services with high scalability and fault tolerance.

2. Why were AMD EPYC processors chosen?
Their high core counts, DDR5 bandwidth, and energy efficiency made them ideal for handling hundreds of OSDs per node.

3. What is Stackscale’s synchronous NetApp storage solution?
A real-time replication service between two Madrid data centers ensuring zero downtime (RTO=0) and zero data loss (RPO=0).

4. Can Ceph be combined with other enterprise storage systems?
Yes. Stackscale supports hybrid designs, integrating Ceph with high-availability NetApp storage for optimal flexibility and resilience.

vía: ceph.io y Revista Cloud

X (Twitter) Facebook Pinterest LinkedIn Email WhatsApp

Ceph Breaks Records: AMD EPYC Deployment Delivers Sustained 1 TiB/s Throughput, Setting a New Benchmark for Distributed Storage

From Legacy HDD to All-NVMe: The Path to Performance

Three Bottlenecks and How They Were Crushed

Record-Setting Metrics

Why This Matters for Ceph at Scale

Industry Perspective: Stackscale on Ceph and Enterprise Storage Strategy

Key Technical Specs

Beyond the Benchmark: The Road Ahead

FAQs

Related articles

Rust Isn’t New. So Why Is It Suddenly So Popular?

VMware vSphere 9.0: Unlocking Full Performance with Field-Tested Best Practices

How to delete a folder in any Linux distribution

How to Start, Stop, and Restart the Firewall on Ubuntu

Ubuntu 20.04 LTS Reaches End of Life: Critical Paths for Sysadmins and Infrastructure Teams

Letta AI: The Open-Source Framework for Stateful LLM Applications

postmarketOS 25.06: Now with systemd, broader device support, and a refined user experience

Professional Fiber Optic Installation: A Comprehensive Technical Guide for Field Technicians

Tailmox: The Open Source Tool That Makes Distributed Proxmox Clustering a Reality

Canonical launches the Ubuntu Security Research Alliance to enhance Open Source Software vulnerability detection