A strong theme in my recent blog posts, including my previous post on the value of Red Hat Ceph Storage, is that data is a business’ most precious asset.
[Also check out: Don’t Wait for an Outage: Plan Now to Protect Your Data]
Protecting data was dead center of our intention at Rackspace when we worked with Red Hat to create our Red Hat Ceph Storage reference architecture. When designing any solution, it is easy to get carried away and only select individual components that are considered best in class. However, that can cause costs to spiral, as would building specific snowflake solutions for every deployment.
Being pragmatic and seeking the balance between capacity, performance, operational complexity, support burden, and cost can be tough with competing priorities, but it is critical to building a repeatable and scalable system which is highly optimized and remains affordable.
The Rackspace reference architecture for Ceph Storage provides two types of storage nodes — one designed for performance and another for capacity. We deploy a minimum of five of either of those nodes to create a cluster. To gain the greatest density, we like to use servers that support 24 x 2.5” disks. While in terms of performance it could be argued that the node itself becomes a bottleneck, we find this solution provides a great low latency offering, at a reasonable density.
Similarly, for our capacity offering, we again use the same chassis, and populate it with a small number of SSDs for performance, and fill the remaining slots with NL-SAS HDDs for high capacity. This provides dense capacity while also providing good throughput, as there are still a large number of spindles available. Typically, we distribute the Ceph monitor daemons either on the management nodes of the cloud, or in larger deployments onto some of the storage nodes themselves. We have found that dedicating physical servers to monitor duties is typically not necessary, except in the largest of clusters.
How many nodes?
I’m often asked if it’s possible to deploy a Ceph cluster of just three nodes, usually with the assumption that five nodes is either too much capacity or too expensive. While a three-node cluster is technically feasible, I pose the following analogy (as a Brit, I tend to think of things in terms of cups of tea and biscuits, but feel free to replace that with a beverage and snack of your own choice): with five nodes, if a single node fails you can relax, have a cup of tea with a biscuit and then fix the failed storage node in your cluster, whereas with just three, you’re forced to drop that tea and biscuit to fix that failed node.
In a three-node cluster, when a single node fails, your data is still safe, but the failure of one more node puts you at risk of a read-only cluster — and angry application owners. While the risk of another node failure is low, it is definitely possible, and not a risk I personally would be willing to take. The larger the cluster, the less impact failures have.
Consider a storage node as a percentage of either performance or capacity; when you lose a node you lose some percentage of those capabilities. In a three-node cluster, each node is equal to 33 percent of your resources, in a five-node it’s 20 percent, and in a 100-node cluster just one percent.
Larger clusters are where Red Hat Ceph Storage really shines. Its auto-healing features maintain the integrity of your data, essentially masking large and small hardware failures, which in turn allows you to enjoy spot of tea — or the beverage of choice. This lets you replace failed hardware in scheduled operational windows at your convenience with no impact to the rest of the storage cluster or integrity risk to your data.
The right flash storage
In any Ceph cluster, based on our reference architecture or your own, it is imperative to the performance of the cluster to have some form of appropriate flash storage in each Ceph storage node, either an SSD in a traditional SAS or SATA disk package, or on a PCIe card.
Minimally, Ceph journals should use SSD. Ceph is very write heavy when it comes to partitions assigned to journal duty, as the journals capture all writes destined for an OSD, allowing for a playback buffer should the daemon or storage node crash before a write is successfully acknowledged by the underlying persistent media. Any flash storage device used for this purpose must be from a manufacturer’s “write intensive” product set, otherwise it is possible that these devices can wear out over time, as flash storage has a certain write duty cycle, measured in Drive Writes Per Day.
Build in extra overhead
Every computer system is bound by the resources available to it. For a storage system, lack of CPU or RAM resources can be greatly magnified, as any impact can lead to higher latency and lower throughput for multiple end users.
With that in mind, it’s always wise to build in extra overhead to handle peak periods, but also for any time system resources are in demand. For example, during the rebalancing of an OSD CPU and RAM, resources can balloon to more than double those required during normal operations. With extra overhead built in, these processes do not affect production operations, enabling the cluster to return to a healthy state more quickly.
Rebalancing happens whenever there is a change in a cluster’s characteristics, i.e. a disk fails, or a node is added to the cluster, etc. This allows Ceph to not only dynamically make the most efficient use of the available hardware, but also ensures that performance of the cluster scales close to linearly as we add nodes to the cluster. The trick here is to ensure that the process is sufficiently throttled to make sure it doesn’t adversely affect production operations. The one exception here is perhaps if you were to add nodes during off hours, and you were confident that the rebalance activity will subside by the time normal activity returns.
Networking in Ceph is extremely important. Very simply, a Ceph cluster is a bunch of commodity servers connected together with a network. We use multiple active-active Ethernet bonds in our Ceph cluster networks.
The first of those bonds is used for compute node to storage communication (the front-end network), and a secondary active-active bond for cluster communication (the back-end network).
It is important to segregate and provide plentiful bandwidth for these two types of traffic, and eventually it will make commercial sense to use 25, 40, or 50GbE at the storage node level, especially as the cost of 40GbE NICs and switches are rapidly becoming more attractive, and more 25/50GbE products are becoming available in the 25/50GbE space. However, at the current time, multiple bonded 10GbE links provides the greatest flexibility and cost per Gbps.
Plan ahead for growth
Finally, with any storage system it is always wise to plan ahead for growth, especially as once you grow beyond one or two cabinets of equipment you can effectively use physical placement of storage nodes as a way to increase redundancy should cabinet scale disasters occur.
For example, in a 3-Replica configuration, you should consider three cabinets as three different failure domains, and store a replica in each one of those. The only caveat is ensuring that the network layer provides enough inter-cabinet bandwidth to not cause bottlenecks in the cluster network. We ensure plentiful bandwidth in configurations of this size by building out dedicated aggregation layers for our deployments, and this is an area where 40GbE definitely makes sense to minimize the number of physical links required from each cabinet.
Red Hat Ceph Storage is a software defined storage solution with extreme flexibility, and can accommodate a multitude of different workload scenarios, but that flexibility can introduce complexity. Rackspace can help remove that complexity with fully managed Ceph clusters as part of our fully managed Rackspace Private Cloud powered by RedHat.