Pre-cloud and Pre-DevOps, things were much simpler. Sys Admins had a clearly defined role. If your boss was upset that the application was slow, you’d log into the server, check top, memory consumption, narrow it down per process, maybe run an strace, and continue troubleshooting until you hopefully discovered the cause.
There were layers that Ops supported and layers that were developer territory. In Ops, we used to say, “We support everything up to and including the OS layer. App layer? You’re on your own.”
It came down to: “You have your job, I have mine.”
This was the world we lived in until the cloud turned things upside down.
Yesterday, at Rackspace, 100 Sys Admins were working in 100 terminals in 100 separate environments. Today, our team motto is “If you’re logging into servers, you’re doing it wrong.” Environments are built from the ground up in Chef using wrapper cookbooks that comprise re-usable core cookbooks. We’ve traded in our terminal SSH sessions for GitHub.
I’ve been working as a Linux Admin and Ops Engineer since 2000. From 2000 until coming to Rackspace in 2008, I was flying solo, with a pager tied directly to my monitoring systems.
When I made it to Rackspace in 2008, I was plunged into a thriving Linux community and I learned more in 12 months than I had in the previous eight years combined.
As my learning accelerated, I became more and more frustrated with the technical limitations of my role. For one thing, I was typing the same commands over and over again, in the same order, every time I faced a similar problem.
Some friends and I started working on a system called Quicksnips that used dmenu, xclip, and xdotool to read from a collection of one-liners and scripts that we stored in git. This enabled us to more quickly perform the same Ops tasks over and over again, and share learnings.
This got us part of the way there, but when it came time for me to apply a kernel upgrade and reboot to over 10,000 servers to fix a vulnerability, a great deal of manual work was involved. Not to mention the stress of kicking off a job that would kick off thousands of upgrades at a time.
Then came the cloud, DevOps, and a whole suite of new automation tools.
At Rackspace, we felt the impact of this seismic event in many ways. After all, we made the transition from a pure service company to a company that released 18 products in 18 months. I still remember the first time Product Developers outnumbered the Sys Admins and Ops Engineers in our Austin office.
Suddenly, the familiar Linux tools that had been in use for decades were not enough.
Sys Admins and Ops Engineers learned how to code. Developers learned how to deploy API-driven infrastructure. Both groups met in the middle and started helping each other out and there was a merger between Product, Ops, and Support.
Then, at Rackspace, we decided to take the skills we learned from these merged teams and put them to use for customers.
We asked ourselves: What would it look like to build out a support offering where, instead of logging into servers to make changes, environments were managed entirely through Chef? This question led to the creation of a new DevOps Automation Service where each customer was its own Chef organization.
We chose Chef because of its maturity, the ecosystem, and the expertise we had in-house.
When faced with the challenge of supporting many customer environments with one team of engineers, we decided to adopt a model of using small reusable cookbooks, which we call “core cookbooks.”
We forked the core cookbooks from the community cookbooks in order to ensure tight standards were in place. This was important for us, since we have around 60 support Rackers supporting customer environments 24/7 who need to be able to jump in and understand a customer environment quickly.
We piece together the core cookbooks into wrapper cookbooks, which contain the core cookbooks in addition to customer-specific logic. We call these wrapper cookbooks “Rolebooks” and they have the added advantage of being tracked in source control.
We adopted a continuous integration (CI) pipeline for testing our core cookbooks, which includes Jenkins as a CI Server, rake for Workflows, RuboCop as a linter, FoodCritic for Chef Syntax, ChefSpec for unit tests, and Test-kitchen + ServerSpec for Functional server tests.
We also built out our preferred monitoring toolkit using Rackspace Cloud Monitoring for alerting, New Relic for Performance Measurement, and StatsD/Graphite for application metrics.
Finally, all of the cookbooks, and any other automation we build out for customers, are stored within private repositories within GitHub. We create GitHub teams for each customer, and invite their Dev and Ops employees to these teams.
Along the way, we encountered some challenges. If you find yourself in the business of writing Chef code full time for many customers and many organizations, here are some things to keep in mind:
Rackspace specialists are gearing up to meet all of you DevOps enthusiasts next week at ChefConf 2014 in San Francisco (April 15 to April 17). Please join Racker Ryan Richard (@rackninja) and I (@mattjbarlow) will be talking about managing many customers, many chefs and tons of cookbooks on Wednesday, April 16 at 3:15 p.m. Come join us! And here’s the full list of what Rackspace specialists have planned at ChefConf 2014.
Also, stop by Rackspace booth to talk to us about DevOps or just to say hello. Don’t forget to grab one of the awesome DevOps t-shirts we are giving away.