Support: 1-800-961-4454
Sales Chat

Object Storage Tomorrow: Erasure Coding



My last post talked about the benefits of Object Storage today. My next few posts will talk about the future direction of Object Storage technology, and the upcoming features that you should be aware of. Today, I will talk about Erasure Coding. Once again, this post will be Swift-centric, but many of these concepts will likely be embraced by other platforms as well.

The What of Erasure Coding

Erasure coding (EC) is an established technology: it is what makes RAID possible. EC transforms an object and creates parity data that makes the object resistant to data loss.

EC parity is generally referred to by the number of data segments plus the number of parity segments. For example, if you think of a 5 disk RAID5, you effectively have 4 data disks, and 1 parity disk. This would be represented as 4+1, indicating, depending on which way you want to look at it, that 4 disks are required for full data integrity, or that you could lose a single disk without losing any data.

If we were to add another parity disk (RAID6 or ADG depending on your storage vendor), we would have a 4+2 model. The models generally discussed for Object Storage are 10+4 or 10+5, although there is no reason why other models could not be used (e.g. 10+2, 100+15).

For a bit more detail on the concept, see this Wikipedia article.

The Why of Erasure Coding

The biggest benefit of EC is that the 3x overhead inherent in Swift can be reduced to a lower number (probably 1.2x to 1.5x depending on durability requirements). The downside is that object storage, retrieval, and error correction (in the event of bit-rot, drive failure, etc.) is significantly more computationally intense, making EC ideally suited for objects less frequently accessed (“warm” storage).

For many large-scale storage use cases, the vast majority of data stored is not accessed on a regular basis. When this is the case, there is minimal downside to utilizing erasure coding for cold data, and the potential for massive reduction in total cost of ownership (e.g. in an environment where 90 percent of the data is cold, and a 1.2x parity level is acceptable, this would yield over a 50 percent reduction in TCO). This greatly enhances the archival value of the Swift platform, as it allows you to achieve tape-like costs, and “warm” storage performance characteristics.

Another potential benefit is greater durability for EC data. A great example is comparing EC vs. 3x replication both with 3x overhead. EC data could be structured in, e.g. 30 “slices,” any 10 of which could be used to recreate the object. This would match the 3x overhead of Swift’s replication model, but while 3x replicas would result in data loss with as few as 3 drive failures (ignoring Swift’s built in data protection features for a minute), the EC object would require 21 drive failures before the data would be lost. At that point the odds of a datacenter-level catastrophe would likely be significantly greater than the odds of data loss through hardware failure.

The Why Not of Erasure Coding

There are already several solutions based on EC available on the market today (e.g. Cleversafe and Scality), however a pure EC solution performs at a dramatic penalty when compared to a replica-based solution. The algorithms available today can encode very efficiently, but retrieving the object is an O(n) operation, so will be considerably slower and more CPU-intense than a replication-based solution. Additionally, since each replica is a full copy of the object in the replica-based solution, object retrieval can happen at 3x the rate of the EC system, where only a single, costly-to-retrieve copy is stored. The easy resolution to this is to take the best of both models and use replication-based data protection for “hot” data, and EC for “cold” data, and that is what Swift’s approach is seeking to accomplish.

Another major hurdle is that due to the relatively large number of “slices” of an object, EC requires a substantially larger number of devices to achieve a desired level of fault-tolerance, so is not suited for small clusters.

Finally, EC is not efficient for encoding small objects, so for datasets involving primarily less than 1 million files, EC is not appropriate (this could be worked around by aggregating a number of files into a single archive, but at a further performance penalty, and greater management overhead).

The When of Erasure Coding

This feature is expected to reach production-ready status within a few months of the Openstack Juno design summit in May 2014. More details are available in this blog post by Swift PTL John Dickinson.

About the Author

This is a post written and contributed by Jonathan Kelly.

Jonathan Kelly joined Rackspace in 2004 as a Datacenter Operations Engineer. He has since held a variety of engineering and architectural roles, and now works as a Solutions Architect for the Rackspace Private Cloud group. Jonathan has a passion for cloud computing, and emerging technologies in general. He can also deadlift over 400 lbs.

  • Wim Provoost

    Hi Jonathan,

    great post on the future of storage and the role of erasure coding.

    “The algorithms available today can encode very efficiently, but retrieving the object is an O( n ) operation, so will be considerably slower and more CPU-intense than a replication-based solution.”

    I think that’s a bit too simple. Retrieving an unencoded large (ie larger than the MTU) object via a network is O( n ) too. Calculating the locations will need to happen both in 3 way replication and with erasure coding. If the object you are fetching is larger than the MTU you will need to make multiple retrievals in both cases. There are arguments pro replication f.e. small file but CPU cost on retrieval is not one of them.

    Anyway, if you do a simple 8 3 Reed Solomon scheme, then you have 8 data parts and 3 others (so basically your coding matrix has the 8×8 unit matrix as the upper part and 3 messier rows). Your original data is in those 8 parts. Retrieving the original object can boil down to retrieving the fragments in the right order (1..8) and concatenating them.

    Wim Provoost

    Product Manager Open vStorage

Racker Powered
©2015 Rackspace, US Inc.