The Control Group CTO: We Needed Performance and Scalability

Today’s guest post is by Shiem Edelbrock, Chief Technology Officer of The Control Group, a San Diego-based tech company whose projects include Instant Checkmate, Reverse Phone and NextGen Leads.

As the CTO of The Control Group, I manage our flagship product, Instant Checkmate, which allows users to perform background checks on just about anybody. We chose ElasticSearch early on to assure we could provide realtime search results to our users. It’s incredibly fast and easy to use, but can be very resource-intensive.

We needed a managed host that could grow with us, so we went with Rackspace. I knew their dedicated virtual model could give us guaranteed performance. And at reasonable traffic loads, Instant Checkmate could scale up and down as needed. However, nobody could have foreseen the tsunami of traffic it would soon be asked to handle.

Our marketing department sold Instant Checkmate so well we soon had far more traffic than we had infrastructure to support it. The way our app was designed, we couldn’t scale up, so we had to scale out, which meant adding more infrastructure in real time. That’s when we ran into the disadvantage of dedicated virtual: limited elasticity. It was certainly powerful, but we needed to add servers within seconds of a traffic spike, not weeks.

Dedicated virtual couldn’t scale fast enough, so we decided to refactor chunks of our app into smaller SOA micro-services and deploy them to the multi-tenant public cloud. The Rackspace cloud provided incredibly powerful tools such as elastic block storage, auto-scaling and object storage, which allowed us to be increasing dynamic as we moved to the cloud.

That worked well … for a while.

The public cloud is a multi-tenant environment. Even a public cloud infrastructure as advanced as Rackspace provides is still susceptible to “noisy neighbor” syndrome. These slightly varying levels of host performance usually go unnoticed — but a resource intensive application like ElasticSearch doesn’t like to compete for IO, especially at scale. When we reached 25,000 shards, the threads on the master nodes just started dying on us. It was a constant battle to keep the cluster healthy.

All of this put us in a Catch-22. Dedicated virtual gave us the performance, but couldn’t scale fast enough. On the other hand, the public cloud could scale, but couldn’t give us the guaranteed performance we needed. I was so frustrated I considered dropping Rackspace — but then they proved why they’re the leader in this space: Fanatical Support combined with a brand new Rackspace product offering.

Rackspace assigned us a brilliant support team that gave us solution ideas we hadn’t considered before. They suggested a machine learning algorithm with predictive modeling that sits behind the Rackspace auto scaling API, allowing us to create a self-aware system that could provision new infrastructure within seconds of a traffic spike, as well as programmatically heal any failures.

Then they brought out the trump card: OnMetal by Rackspace.

OnMetal’s single-tenant bare-metal servers totally bridged the gap for us. We get the guaranteed performance of dedicated virtual with the fast scalability of the public cloud. And because OnMetal uses the same OpenStack API as the Rackspace public cloud, we had it up and running almost immediately. The existing tools we had so painstakingly built to manage our resources on the public cloud worked seamlessly out of the box. Finally, we could provision new server instances within seconds of a spike with guaranteed performance while avoiding noisy neighbors.

Long story short, we took their advice, implemented their machine learning algorithm and let Rackspace handle the plumbing behind the scenes. Before OnMetal, our average page load time across sites was around 1 second — pretty slow. Now, all of our sites load in sub-500 milliseconds. Our data center utilization is also far better now. Before OnMetal, we were hovering around 20 percent. Now, we’re at 65 percent and we’ll probably get up to 70 percent when all is said and done.

With OnMetal, we got scalability and guaranteed performance. That’s when I truly understood why you pay a premium for Rackspace managed infrastructure. It’s not the commodity of the hardware that Rackspace offers, it’s the service that goes along with it.

And because they’re managing the infrastructure, my team can focus on what it does best — developing apps that deliver real business value.

LEAVE A REPLY

Please enter your comment!
Please enter your name here