This is my first blog post in a multi-part deep dive series that will explore one of my favorite components of OpenStack — Swift. My main focus is to place my years of Swift experience into these posts. Swift is well documented if you know exactly what you’re looking for — but if you’re at a loss, this series will help!
- Part 1 will provide an overview of Swift data placement and how to best design your Swift cluster
- Part 2 will show you the basics of Swift-recon and some deep dives on troubleshooting
- Part 3 will focus on troubleshooting, identifying objects and reverse-engineering objects and files on disk
These posts will assume the reader has a basic understanding of Swift fundamentals with some nuggets you may not know.
The 20,000-foot view
Swift is an open source object storage component of OpenStack which can be used with other OpenStack projects such as Nova or used as a standalone product. It uses open source programs such as python and Rsync to carry out day-to-day tasks, scan files for corruption and move files into a safe place in the event of node downtime or drive failure.
There are a few basic parts to Swift proxy: account/container and object. You can combine account /container/object on a single node (sharing the same storage) OR you can break them out on different nodes to increase speed and data durability. Breaking out services is smart. It increases fault tolerance and allows more failures to take place without disrupting or degrading service.
Swift’s data placement is done with the three Ring files account/container and object services. These rings contain IPs of the account/container/object node, which is the drive used for storage and partition information on which nodes are responsible for what. What’s the Ring, you might ask? Taken from the definitive source on the Ring, the Ring refers to three files that are shared among every Swift node (storage and proxy nodes):
There is actually one ring per type of data manipulated by Swift: objects, containers and accounts. These files determine on which physical devices (hard disks) each object will be stored (and also each container and account).
The number of devices on which an object is stored depends on the number of replicas (copies) specified for the Swift cluster. You can configure the number of replicas you wish to keep on your cluster when the rings are created or later, via storage policies. However, storage policies will not affect account and container replicas. As you scale out your Swift cluster, the ring will automagically move the existing data on the cluster to the new devices just by adding them to the ring and adjusting the weight over time.
Facts about Swift I wish I knew when I got started
Any customer wants you to maximize transactions per second — after all, no one talks a great deal about how well a sports car is built, just how fast it can go from 0 to 100 mph. Swift is no different, it’s all about client speed and making transactions flow as fast as possible.
Proxy tier maths to return a 200 OK after a write/upload are as follows:
- Client starts upload.
- Proxy opens connections to all primary nodes, if any primaries fail connections are made to handoff nodes.
- When the following conditions are met, a 201 is returned, Replica / 2 + 1 = 201 Created.
Looking at an example with 3 replicas:
- A client writes a file to a container
- Client will not get a 201 OK returned unless 3/2 +1 = 2 are written to disk
- You can safely assume two copies are flushed to stable storage, the “third” copy will be filled in by the replication passes. In normal operation, all three copies are stored and synced to disk. If only two get written, this missing copy will be filled offline by a background replication daemon. If only one (or zero) gets written, a 5XX type response is given and the client will have to retry.
These assumptions should be the foundation for how you build your cluster’s layout. When a proxy server encounters an error writing to a primary location, a handoff node will be chosen. When this occurs, we will rely on replication to get the data to the correct primary once the primary becomes available again. There are a multitude of reasons why a primary would be offline, including a downtime event, networking issue or the drive is too busy to accept the request at the moment. On a subsequent GET/HEAD to the object, the object must only be present on one primary (or one handoff) for a successful response.
Data placement in Swift
Swift is unique as data placement is done with these principals in mind, Region -> Zone -> Disk. We are bypassing the “node” since the failure domain in this case is disk. Since multi-region is out of scope for this post, let’s focus on zones. Zones can be a very confusing concept for some. A zone is abstract, it can be a data center, a data hall, a power feed, a rack — the list can go on. Let’s explore how I have come to terms with zoning production workloads — the intent is to be logical for the data center administrators and the engineers running the infrastructure.
Here is my logic, which has gotten me by so far:
- Let’s say you have object A.
- You write it to the Swift cluster, three replicas, three cabinets and three zones. Proxy will write Replica / 2 + 1, so two copies get written to disk accounting for failures.
- If all nodes are up and running you now know you have two copies in two cabinets. The third copy will be filled in by replication.
Fine and dandy right?
Let’s say it is third shift and a TOR (Top Of Rack) switch goes out on one cabinet. If you take the zoning idea I just mentioned then you KNOW with 100 percent accuracy that you have two copies still living in the other two cabinets, so don’t hit the panic button. Let’s say you lose three nodes, in three zones, z0, z1, z2. Well, you KNOW there is data “missing” and/or “unavailable” on the ring — hit the panic button!
Doing any server work is simple with this cabinet zoning. Here at Rackspace, we typically have a large-scale deployment with multiple cabinets. When some re-work is needed on the hardware, my instructions to our DC folks is to perform work on one cabinet at a time. This allowed for 100 percent certainty that multiple copies still survived and clients didn’t experience data loss or unavailable objects while nodes got cycled in and out of rotation.
Here’s another fun trick: to cut down on TOR switch to TOR switch traffic, you should install proxy servers in the same cabinet as the account /container / object servers and use read affinity and scope per cabinet. This will minimize cabinet cross talk and yield faster performance overall for most workloads.
It’s important to understand that Swift benefits from particularly configured hardware for different cluster requirements:
- Proxy servers will be heavy on CPU and networking since one object comes in and multiple copies get written to back-end object storage synchronously. Depending on the size of the ring you might need to increase the memory in the proxy tier.
- Account / container / object servers will require raw disk speed, account / container would benefit from SSD or NVME media since account / container services are sqlite databases and will sometimes require a high degree of throughput. Raiding account / container SSDs together can achieve higher IO and should be used in high performance edge cases that demand it. This is the only acceptable place for hardware RAID in a Swift environment besides using RAID for OS drive durability.
- Object servers should be configured with HBAs or RAID cards using pass-through for object storage. XFS best practices also dictate that any HBA/RAID cache be disabled, leaving caching enabled could lead to corruption if there’s a failure while data is in cache.
Software and networking considerations
As you’d expect, tuning the network for Swift can be a challenge too. Some best practices we’ve discovered include:
- Account/container/object/Proxy servers would benefit from separate networks, one for management/incoming writes and another for replication network. We have seen great performance gains in using LACP L3+4 bonding on both networks for increased throughput and network resiliency.
- If you’re setting up Swift for specific workload types, running bonnie++ for different XFS options such as agcount and logbsize can squeeze additional performance out of your disks. As always, try to test for the specific workload after synthetic bench-marking to make sure the results are as anticipated.
- Sysctl tuneables should be focused on with regard to TCP connections, if you’re running nf_conntrack the defaults will need to be raised, I would suggest as a starting point 500k per account / container / object / Proxy and have monitoring in place if you start hitting those limits in production.
- TCP reuse should be enabled to cut down on any TCP handshake overhead that will occur in normal production.
Faults on disk
As far as faults go, there is no magic with Swift and how it functions.
Swift uses open source binaries like rsync to move data around. If you have a failed drive, the data is not rebuilt until that drive is replaced or removed from the ring. This is key since other object stores will rebuild failed copies until that drive becomes available again or is removed from the ring.
If a faulty device or server is removed from the ring, a new or existing drive will now have those partitions moved over to it after re-balance has occurred. Having multiple drives fail in a cluster and operating like this over time you increase the likelihood that common data is shared on those drives and until you try to read that data back you may not know it is gone forever.
I hope you enjoyed this overview of Swift data placement and how to best design your in-house Swift cluster holistically from cabinet design to service placement. If you setup your cluster according to these guidelines, with year-over-year growth in mind, Swift will run flawlessly for years to come with little to no effort besides periodically replacing faulty drives.
In the upcoming posts, I will cover day-to-day telemetry that can be pulled from your Swift cluster and how to read and react to different situations that you may encounter in your day-to-day operations of the cluster. I will also shed light on hard drive UREs/Read failures, running audits on Swift to keep it honest, how to best cope with silent corruption and keeping your data safe.
If this post has inspired you to learn more about Rackspace’s approach to delivering OpenStack private cloud as a service, we’re happy to help. Take advantage of a free strategy session with a private cloud expert — no strings attached. SIGN UP NOW.