Operational Best Practices for Red Hat Ceph Storage


In this, the third in a series on the benefits of Red Hat Ceph Storage, I describe some of the key operational aspects of looking after your own Ceph Storage cluster.

My first post offered a general overview of Red Hat Ceph Storage, while in the second I provided some detail around the architectural choices behind our managed solutions.

Note: if you choose to consume OpenStack and Ceph Storage as a service, as part of Rackspace Private Cloud powered by RedHat, Rackspace takes care of these tasks for you — one of the many benefits of consuming OpenStack as a service.

After deployment

Now that you’ve deployed your Ceph cluster and it contains valuable data, you need to manage it like any other system. And while Ceph’s self-healing capabilities remove a lot of the system management headache associated with more traditional storage systems, we still need to monitor and investigate failures to schedule maintenance activities.

A great starting point to check a cluster’s health is using “ceph -s” which gives an overview of the health of a cluster. Start by checking for “HEALTH_OK” — if the cluster is not in that state, there is usually a snippet of information that helps explain why the cluster is not in a healthy state. For example, are all the monitors (MONs) up, does the “osdmap” show the correct number of OSDs, and are they all in “up” and “in” states?  Are any Placement Groups (PGs) in a state other than “active+clean”?

“ceph osd tree” provides further insight into individual OSD statuses, allowing you to identify those that are down, and on which servers they reside, as well as their position in the cluster hierarchy. For example, if you have numerous OSDs and servers down, that could point to a rack scale event, rather than a single disk or server failure.


During normal operation of a Ceph cluster, if a node fails or if we add more capacity to the cluster, a rebalance is triggered. Depending on the scale of the failure or capacity addition, the amount of work the OSDs need to complete can vary. Rebalancing should never cause the cluster to go offline, but it can cause significant slowdowns by saturating system and network resources, which to some end-users can feel like a system outage.

The OSD operation in Ceph that resynchronizes data after a cluster change is called a backfill, where the data held by primary and secondary OSDs is compared, and as required is copied to return the cluster to a normal state. The impact of a cluster change from a failure or capacity addition can be mitigated by reducing the number of simultaneous backfill operations. This will reduce the back-end load and therefore the impact to production operations, however, it will increase the time taken for the rebalance to complete, but is a trade-off that is acceptable.

For example, we can do this on the fly by issuing the following broadcast command to all running OSDs:

# ceph tell osd.*  injectargs  ‘–osd_max_backfills 1 –osd_recovery_max_active 1’

Or we can configure similar in the ceph.conf file to persist across OSD restarts:

osd recovery max active = 1

osd max backfills = 1

As a rebalance progresses to help speed things up we can incrementally increase these values, and observe production operations for any slowdowns, if there is any adverse effect experienced, we can back off the number of processes.


We have observed across multiple deployments that it is often better than not to have a greater number of Placement Groups in a cluster. With a lower number of PGs we have frequently observed hot spots in clusters, both in terms of IO activity, but also in terms of spreading data unevenly across the disks in the cluster.  By using “ceph osd df” we can gain a quick insight into how evenly the disks are being utilized.

“ceph df” like linux’s “df” provides information on the capacity of the cluster, and gives us a view into how much capacity each storage pool is consuming.  Remember that it is recommended to keep 15% of cluster capacity free so that Ceph has enough overhead to manage around hardware failures without causing performance or availability issues, so ensure that you plan ahead to ensure that you’re cluster never becomes full.

In Ceph we can configure two types of storage pool, Replicated or Erasure Coded. While the latter offers greater useable capacity from our RAW disks, there is a significant CPU overhead required to break data into chunks, and calculate the additional parity chunks.  Similarly, when requesting data, the cluster needs to re-combine these chunks to provide the original data back to the requestor. So while it might seem like the perfect cost saving approach, yielding a lower cost per GB stored, for IO intensive RBD workloads it is strongly discouraged. For lower IO intensity object stores, (accessed either with the swift or S3 API), it is viable.

A Ceph cluster is a powerful and flexible, scalable storage system. There are, however, a large number of optimizations that can be made that don’t fit every use case, some customers do not want to take on that complexity, and prefer to hand that to an expert operator, such as Rackspace.

Free strategy session

We have successfully scaled OpenStack to thousands of nodes and operate private clouds for some of the largest companies in the world. Talk to our proven experts about your business objectives and we’ll strategize with you the solutions needed to achieve them — no strings attached. SIGN UP NOW.

Special thanks to Scott Gilbert, OpenStack Private Cloud Engineer, Rackspace for his input into this blog post.

  • Was this article helpful ?
  • Yes   No
Philip Williams joined Rackspace in 2011, after working for Yahoo! and EMC in storage related roles. He has been responsible for designing solutions and infrastructure to support many different industry sectors, including Rackspace's multi-tenant environments. More recently he is responsible for the integration of Enterprise Hardware and Open Source Software Defined Storage with Rackspace's Openstack Based Private Cloud offering.