In a sector where every millisecond of latency and every megabyte per second counts, a recent Ceph performance analysis — based on extensive testing with the IO500 benchmark — has delivered revealing conclusions for system administrators and distributed storage architects. The study, conducted in a controlled environment with enterprise-grade hardware, identified which configurations and optimizations bring tangible improvements, which are irrelevant, and which changes can even be counterproductive.
Beyond the metrics, this work highlights the importance of validating any adjustment in a lab before moving it into production, avoiding the temptation to apply generic “recipes” without understanding the real impact on the specific workload.
Context: Ceph and the IO500 Benchmark
Ceph has established itself as one of the most versatile and robust distributed storage solutions on the market. Its ability to offer block, object, and file system storage (CephFS) makes it a common choice for high-performance environments, from corporate data centers to supercomputing.
To evaluate performance, the study used IO500, a benchmark recognized in the HPC (High-Performance Computing) community for measuring small and large I/O operations, both read and write, in a balanced manner. IO500 is widely used to compare systems in the TOP500 list and provides a comprehensive view of a system’s ability to handle real-world workloads.
Testing Methodology
The lab configured a Ceph cluster with storage nodes equipped with SSD/NVMe drives and high-speed networking. The baseline configuration included:
- High-performance CPU per node.
- 100 Gb/s internal network connectivity.
- Dedicated MDS (Metadata Server) nodes for CephFS.
- Between 12 and 24 OSDs per node depending on the scenario.
Tests were run multiple times, tweaking key parameters such as:
- Number of active OSDs.
- CPU affinity.
- Maximum transfer size (MTU).
- Number of active MDS nodes.
- Use of SSDs versus NVMe.
- Variations in
osd_memory_target
andbluestore_cache_size
.
Each change was measured to determine whether it provided improvements, was neutral, or degraded performance.
Key Findings: What Really Makes a Difference
One of the clearest results was the impact of CPU performance on metadata tests. IO500 includes metadata-intensive operations (creating, listing, deleting files), and in this area, the bottleneck was not physical storage but the processing capacity of the MDS nodes.
Another relevant conclusion was the direct relationship between the number of OSDs and overall performance. More active OSDs, up to an optimal point, allowed better parallelism, especially in sequential read and write operations.
On the networking side, increasing the MTU to high values (such as 9000) did not yield consistent improvements in all scenarios. This suggests that in certain Ceph environments, network optimization does not always translate into noticeable gains if other factors are the true bottleneck.
Effective Optimization: Changes That Worked
From all the tests, several optimizations proved to deliver reproducible improvements:
- Assign more CPU and RAM to MDS nodes
Increasing resources for metadata servers drastically reduced latency in small-size operations. - Increase the number of OSDs in a balanced way
A higher number of well-distributed, homogeneous OSDs boosted sequential and random I/O performance. - Use NVMe for journaling and metadata
Storing metadata on NVMe devices improved small operations, where storage medium latency is critical. - Adjust
osd_memory_target
Matching OSD cache memory to available physical RAM optimized resource usage, avoiding swapping and keeping frequent data readily accessible.
Ineffective or Counterproductive Optimizations
The analysis also helped debunk some myths:
- Blindly increasing the MTU did not always deliver benefits, and in some cases introduced additional latency due to packet fragmentation and reassembly.
- Excessively increasing client threads degraded performance by saturating the CPU and lengthening I/O queues.
- Overly aggressive cache settings caused latency spikes when the system had to abruptly free memory.
The Best Reproducible Result
After multiple iterations, the testing team achieved a configuration offering the best balance between small and large operation performance. This setup included:
- 2 active MDS nodes with increased CPU and RAM.
- 20 OSDs per node using NVMe storage.
osd_memory_target
optimized to 75% of the RAM available per OSD.- Standard MTU, avoiding unnecessary changes in stable environments.
- Data and metadata stored on separate devices.
With this configuration, IO500 metrics consistently outperformed baseline scenarios by over 30% in small operations and 20% in large transfers.
Practical Recommendations for Ceph Administrators
From these results, several guidelines emerge for those managing Ceph environments:
- Always measure before and after any change, using reliable benchmarks and representative workloads.
- Prioritize CPU and RAM for MDS when metadata operations are critical to the application.
- Scale OSDs wisely: more is not always better if the network or CPU cannot handle the load.
- Isolate metadata and journals on NVMe to reduce latency.
- Avoid massive changes without prior testing, especially in networking and cache parameters.
The Importance of Continuous Validation
The study confirms that there are no magic configurations that fit all Ceph deployments. The optimal combination of hardware and parameters depends on workload nature, access patterns, and available resources.
In this sense, replicating a methodical approach — like the one used in these IO500 tests — can make the difference between a system that just works and one that fully leverages its hardware capabilities.
Frequently Asked Questions (FAQ)
1. What is IO500 and why is it used to test Ceph?
IO500 is a standard HPC benchmark that evaluates storage system performance by measuring both small and large operations in reads and writes. Its relevance lies in simulating mixed workloads similar to real-world production scenarios.
2. Does increasing MTU to 9000 always improve Ceph performance?
Not necessarily. While it theoretically reduces network overhead, in these tests it did not provide significant gains and sometimes introduced latency. It depends on the topology and overall configuration.
3. Is it better to use SSD or NVMe for Ceph?
NVMe drives offer lower latency and higher performance than traditional SSDs, particularly beneficial for metadata and journaling. However, cost and availability can influence the final decision.
4. How many OSDs per node are recommended?
There is no universal number. In these tests, 20 OSDs per node with balanced hardware performed well, but each environment should validate based on its load and resources.
vía: croit.io