As anyone who’s ever tried to move a technology from development to production knows, operations and scaling are two of the most difficult elements to do well.
Nowhere is that more true than with the OpenStack cloud platform. Deploying an OpenStack cloud with tens of servers to support a few hundred users is a far cry from operating and scaling a cloud with thousands of servers supporting many thousands of users.
As one of the few cloud operators to reach that scale, Rackspace embraces our responsibility within the OpenStack community to give back and share what we’ve learned.
This post is a summary of those lessons, ideas and contributed code we’ve shared with the community related to operating at scale. It also includes capabilities that may become part of our private cloud offerings. Many thanks to Matt Van Winkle from our public cloud operations team for all the great information.
Nova cells is a construct Rackspace created to solve specific issues in our public cloud. Cells are now part of the OpenStack project and used widely by other companies including CERN and GoDaddy. Cells are how OpenStack partitions Nova compute nodes into discrete groups — aka cells — each running its own Nova compute services, including the Nova database and message queue broker. All the compute cells tree up to a global cell that hosts the global Nova API service.
While cells are typically discussed as a way to scale OpenStack, Rackspace actually created cells to address several use cases:
- Scaling – Cells have enabled to us scale our public cloud to many tens of thousands of nodes and 350,000+ cores while maintaining acceptable performance. Without the ability to partition our compute nodes, a single nova database and RabbitMQ message broker would be overwhelmed and inhibit our ability grow. Leveraging cells, we’ve been able to maintain a standard of ~100 hosts per cell.
- Reducing Failure Domains – One reason we limit a cell to ~100 hosts is to limit the impact of networking issues such as broadcast storms, which can create cascading issues that would go unchecked without cells as a boundary. We also use cells is to limit the impact of failures to any single Nova database or RabbitMQ instance. The Rackspace Public Cloud spans multiple geographic regions and each region has multiple cells. Failures to the database or message broker in any given cell only impacts that cell and not an entire region.
- Supporting Multiple Compute Flavors – The Rackspace Public Cloud has always used different hardware types and compute flavors. For example, when compute nodes with Solid State Drives were first introduced, users could launch general purpose instances on servers with SATA drives or new performance instances that run on servers with the new SSD drives. We use cells to group the older SATA servers with each other and the new SSD servers with each other. (Note that a Nova feature called Host Aggregates can also be used and there are use cases where it can be used with cells to partition compute nodes.)
- Supporting Multiple Hardware – We leverage live migration in our public cloud for operational tasks such as maintenance and troubleshooting. Live migration works best when instances are migrated across compute nodes with the same CPU type. Since Rackspace sources hardware from multiple vendors, we use cells to group servers from the same vendors together.
One of the projects focused on the deployment and operations of OpenStack is Triple O. The precursor to Triple O is iNova, which is what Rackspace uses to run our public cloud. iNova is a set of servers provisioned to run an OpenStack “under-cloud” which acts as the control plane for an OpenStack “over-cloud.” This “over-cloud” is the actual Rackspace Public Cloud on which users provision instances.
Virtual Machines are provisioned on a set of seed servers in each Rackspace Public Cloud region and these VMs become the virtualized control plane for the iNova OpenStack under-cloud for that particular region. The control plane is then used to provision instances on a second set of servers that function as the iNova compute nodes. These instances then become the OpenStack control plane for our over-cloud — aka Rackspace Public Cloud —and manages the compute nodes in the public cloud.
The implementation of a virtualized OpenStack control plane provides several benefits but also brings several challenges:
- It’s much easier to dynamically deploy, tear down and redeploy OpenStack services when they are running in VMs.
- We can react to issues more quickly. For example, an unexpected spike in RabbitMQ due to some error in our cloud; we can quickly respond by spinning up multiple global cell workers to handle the spike until the issue is remediated.
- Since our control planes run as Nova compute instances, any bugs that effect our user instances will likely effect our control planes.
- We are increasing the complexity by adding an additional OpenStack cloud to every public cloud region.
Containers, a trend that has surfaced since the creation of iNova, could bring similar benefits without some of the challenges. We are considering moving the control plane to containers in the public cloud — a huge undertaking. Our Private Cloud, however, already runs OpenStack services in containers.
Virtualized Compute Nodes
One of the unique approaches Rackspace takes operating OpenStack is our use of virtualized compute nodes for managing our hypervisor nodes. In a typical OpenStack deployment, with KVM as the hypervisor, Nova compute services are installed on the hypervisor nodes so that every hypervisor is also a compute node.
In our public cloud, we use XenServer instead of KVM as our hypervisor technology, in part because our pre-OpenStack public cloud was based on Xen. One benefit is that the XAPI interface does a good job of remotely managing the Xen Hypervisor. So we run Nova services in a VM provisioned on the hypervisor node and use that nova compute VM to remotely manage the hypervisor node.
While this is an admittedly unusual approach, it provides several operational benefits:
- By isolating the compute node in a VM from the hypervisor node, we can limit the impact of a misbehaving compute node. If we have to reboot a compute node or even rebuild it, we can do so without have to take down the hypervisor node or the instances running on that node.
- Separating the compute node in a VM from the hypervisor node also provides an additional layer of security since our support tech can log on to a compute node without needing full access to the hypervisor node itself.
Our use of techniques such as iNova and virtualized compute nodes are always being evaluated and refined, which means we are currently considering these changes:
- Using containers instead of VMs for our virtualized compute nodes
- Writing code that could further abstract the compute node from the hypervisor node so a single compute node can manage multiple hypervisors and/or have multiple compute nodes manage multiple hypervisors in a redundant setup.
One of the key lessons of distributed computing at scale is that failures are inevitable. because of that, much of our focus has been on maintaining the availability of our cloud services even when individual components fail. We know this type of fleet management cannot be done manually and so Rackspace has, over the years, created tools to automate many operational tasks.
Those tools include Resolver and Auditor, used to discover and remediate issues in our public cloud and to make it as self-healing as possible. Resolver was written to automate repeated tasks such as rebooting certain services. It accepts inputs such as alerts, RabbitMQ messages and manual commands.
Auditor was created to automatically and continuously monitor the fleet of servers in the Rackspace Public Cloud to validate that they comply with a given set of rules. Alerts are created for servers that are flagged as out of compliance. In a growing number of cases, Auditor sends a message to Resolver, which will then take the appropriate action. Two examples include:
- Any nodes that are found by Auditor to be running the wrong code are flagged and submitted to Resolver for automate upgrades to the appropriate code.
- If Auditor finds hypervisor nodes that have to be rebooted for certain known issues, Resolver will live migrate instances off those nodes, reboot the nodes then live migrate instances back.
This ability to automate and heal OpenStack clouds is a critical part of operating at scale. Efforts to do this in the community include the Stackanetes project, to enable the OpenStack control plane to run in Docker containers managed by Kubernetes. Kubernetes would then handle automated tasks such as restarting services. yet while the Stackanetes project is encouraging, it is new and unproven and still has scalability limits short of what is required to run a large scale OpenStack cloud.
To move fleet management forward, Rackspace has open sourced some of our tools through a project called Craton. The Craton project allows us to share tools we’ve built with the community. We invite the community to look at Craton and work with us to make fleet management even more useful.
In the months ahead, Rackspace will be doing even more to share what we have learned and tools we’ve created to help to make OpenStack easier to operate at scale. This will benefit the OpenStack community in general and more specifically Rackspace customers who are consuming our public cloud and/or our private cloud.
Click to learn more about Rackspace Public Cloud.