Upgrades. Few IT activities are as important yet heartburn-inducing as upgrading a major system.
Often upgrades mean service disruption, both expected and sometimes unexpected, which can negatively affect businesses. Often the cause is unreliable software and/or lack of operational expertise.
I’ve worked with and for enterprise customers who had policies of not upgrading until a particular piece of software or driver had been out for at least two years. I know customers with policies of not ever upgrading software, unless they had no choice. Whatever the reasons, these enterprises have often adopted a risk avoidance stance on upgrades, e.g. “if it isn’t broke, don’t fix it.”
OpenStack has by no means been immune to the upgrade conundrum. The community has prided itself on being able to develop software quickly, as evidenced by the project’s cadence of two major releases per year. But upgrading each time has often been painful with many earlier releases requiring a complete reinstall as the “upgrade” process. While there have been significant improvements, upgrading OpenStack is still not a task to take on lightly, especially without significant operational expertise.
Rackspace has always been committed to making operational IT tasks, like upgrades, as painless as possible for customers. Our goal is that upgrade events should be something customers can forget about because Rackspace has them covered.
In particular, we’ve worked with Red Hat to combine the advances Red Hat OpenStack platform has made around software reliability and non-disruptive upgrades with Rackspace’s operational expertise. The end result is that Rackspace Private Cloud powered by Red Hat, aka RPC-R, is an innovative and yet reliable OpenStack platform for enterprises to consume.
We rely heavily on the work Red Hat has put into improving the OpenStack upgrade process through the Triple O project, which leverages a small OpenStack “undercloud” to manage a multi-node OpenStack “overcloud.” The overcloud is the OpenStack deployment users access to provision resources and run workloads.
All RPC-R deployments rely on the undercloud, which is comprised of a director node and a logging node. The Director node uses IPMI, PXE, Puppet and OpenStack Heat to bare-metal provision, deploy and configure the overcloud. Director is then used to manage the lifecycle of the overcloud, which includes expanding the overcloud, package updates and both minor and major version upgrades.
Red Hat has done an amazing job of making OpenStack upgrades easier than ever, and now Rackspace has leveraged our years of experience and expertise managing OpenStack to fine-tune and automate the process further. The difference between a successful upgrade and a failed one is often not just software, but the operations behind the upgrade. For RPC-R customers, we take additional steps to prepare their environments for the process.
The technical details in this blog post have been written with invaluable contributions from Manuel Rodriguez and Allie Barnes. They and the rest of the RPC-R support team at Rackspace are literally the brains behind this operation.
A typical RPC-R upgrade will follow this process:
Every RPC-R major upgrade is planned out by a member of the Rackspace support team and then quality checked by another member. Once the maintenance plan is internally approved, Rackspace works with customers to ensure the proposed plan meets their expectations. Once the customer signs off, we schedule a maintenance window for the upgrade.
- Temporarily suppress monitoring alerts to prevent unnecessary alarm notifications.
- Take a backup of the Galera database.
- Check that all OpenStack services and relevant OS services are running and operational.
- Get list of all VMs and volumes running on the overcloud.
- Ensure all endpoints are responsive.
- Update repositories from previous to current version.
- Upgrade the RHEL OS on the director node.
- Upgrade the undercloud, which will trigger an update of the director node configuration and populate any new required settings that may not be present from the last version (this operation will not delete any data from the director or overcloud nodes).
- Once the undercloud upgrade is complete, the current overcloud images are backed up
- New overcloud images are replaced on the director.
- From the director, the overcloud is upgraded using the commands outlined in the Red Hat OpenStack platform upgrade guide, which are basically heat templates triggering puppet manifests, along with additional upgrade environment scripts Rackspace has written to further automate the process.
- Any object storage nodes will be updated first.
- Controller nodes will then be upgraded with a script that will take care of upgrading the high availability tools on the controllers (such as Pacemaker).
- Rolling upgrades are performed on the compute nodes by migrating off guest instances, disabling the nova-compute service, running the compute node update script, re-enabling the nova-compute service, and then repeating the workflow for each compute node.
- Ceph nodes will be upgraded last with the same script run on all non-controller nodes.
To finalize the upgrade, one final command is run with a major upgrade playbook and all of our individual environment files from our initial deployment command
- OpenStack Tempest tests are run to validate the upgrade.
- Once the Overcloud is upgraded and Tempest tests are completed, Rackspace will confirm normal operations have resumed as expected in the Overcloud.
- When RPC-R Support has confirmed the environment is in a healthy operational state, alarm suppression is canceled which re-enables monitoring services.
- Keeping upgrade operations so simple that customers can forget it can be complicated involves reliable software, automation and sound operational practices. With RPC-R, Rackspace and Red Hat is delivering a private cloud service that customers can start consuming as they would a public cloud service without the operational heartburns.