How to Synchronize Terminated AWS Instances with Chef, New Relic and Other Saas

jimandmartin

How to Synchronize Terminated AWS Instances with Chef, New Relic and Other Saas

As we launch Fanatical Support for AWS, we’re publishing a series of posts from some of our top AWS-certified experts to help you understand our offering and get started. Here, DevOps engineers Jim Rosser and Martin Smith explain in detail some of the ways Fanatical Support for AWS can assist with scaling up and down for spikes in website traffic.

As cloud services become more ubiquitous for monitoring, configuration management and backup, these services often require a software agent to run on your servers, and they usually “register” each server back to the corresponding service.

Many architects immediately recognize that there is no easy way to uninstall or de-register these agents — there’s no simple way to specify a script to run at instance termination like launch configurations do at instance creation. Often, this terminate step is required to decommission or suppress monitoring alerts, purge backups or stop configuration management batch processes altogether.

Some simple options for handling orphaned agents — from Chef or New Relic, for example — should be considered first. One option is to simply not run the agent. For Chef, this could mean something as simple as using Chef Solo (agentless), AWS OpsWorks’ automatic de-registration or choosing a tool like Ansible that doesn’t require an agent.

Another option might be to install an init script on every instance that will be called when the instance is about to terminate, but the instance must terminate within 30 seconds or it will be destroyed regardless, which could present some timing issues (and may not clean up all the Chef data either, e.g. chef vault).

Finally, you could simply rely on third party services like Chef or New Relic to clean up the data about the terminated instance (or clean up your own if you run your own Chef server or monitoring service). However, if you’re using real-time monitoring, you’ll be creating many false positive alerts. If orphaned agents cost you money, you’re also missing an opportunity to control cost by leaving these orphaned registrations around.

As is the case with many AWS products, Auto Scale events can trigger a notification in the Simple Notification Service (SNS). AWS has also introduced the concept of Auto Scale lifecycle hooks that allow your application to control the change in instance state between “Terminating: Wait” and “Terminate.”

These two solutions are quite elegant and complex, but they will likely incur additional cost and will require a considerable amount of additional scaffolding to be built by the user. This could mean additional code and instances to handle SNS/SQS messages or lifecycle hooks, as well as a process for cleaning up any missed events.

Finally, your best option might be AWS Lambda.

According to Amazon, Lambda is “a compute service that runs your code in response to events and automatically manages the compute resources for you, making it easy to build applications that respond quickly to new information.”

Lambda obviates the need for you to manage additional infrastructure in AWS, while still giving you the flexibility to respond to SNS messages about terminated instances.

We propose using SNS messages submitted by Auto Scale to trigger Lambda, which in turn will perform any decommission procedures after the instance has been terminated. You could also add another round trip of SNS messages, using Auto Scale Lifecycle Hooks, to prevent instance termination until Lambda’s tasks are complete. We opted to terminate first and do the cleanup afterwards for this example.

In our search for a solution, one thing we quickly realized is that code shipped to Lambda can only be written in NodeJS or Java. Not all developers might be familiar with these languages, and depending on what service or APIs you’re communicating with, an SDK or even best practice may not exist in NodeJS or Java.

For instance, we found that there are multiple libraries for communicating with New Relic’s Rest API, of varying quality and support. Because our team isn’t maintaining any NodeJS or Java applications right now, this seemed like a pretty big hurdle until we saw that many folks are using Go with Lambda.

This feels like a natural fit for Lambda, as Go is written expressly with parallel execution in mind. It’s binaries and dependencies are self-contained and we can upload a single package to Lambda with a NodeJS wrapper, the Go binary and our code. We even found projects that maintained the NodeJS wrapper, so one might automatically build and package the whole thing and upload it to AWS Lambda. We haven’t tried that just yet.

We’re especially concerned about un-registering server agents when instances terminate, as they can become alerts, disturb on-call staff, impact SLA calculations and much more. While the New Relic Server Agent automatically registers itself when data starts flowing from new instances, the tradeoff is that you, “cannot delete an entity whose color-coded health status is red, yellow, or green.”

What’s more, it may even take a few minutes from when agents stop reporting and when the New Relic entity can be deleted. We learned from a recent Rackspace Office Hours Hangout that this is known at NewRelic as “the gray server problem.” So we set out to write some code in Go, triggered by an “instance terminated” SNS message, which can delete a server entity from NewRelic.

We quickly discovered that we actually don’t have any way to match terminated AWS instance IDs and New Relic server entity IDs. This means that when our Go function runs in Lambda, we have no way of knowing what New Relic Server entity to actually delete. And because the instance is terminated, we can’t rely on using the instance metadata service from within the instance itself, either.

To solve this, we opted to have a post-creation hook in lambda that would store some basic metadata about new servers into a DynamoDB table. This allowed us to reference the hostname in NewRelic with the ID of the instance provided in the SNS message after the server termination.

We chose this method because it allows multiple services to utilize this metadata. An example of this would be using this data to remove nodes from a configuration management service such as Chef or Salt. Another option would be registering the New Relic agent to use the AWS instanceID as the hostname which would mean you could forego the DynamoDB table, but that would also mean your New Relic server host names couldn’t be anything other than an instance ID.

In conclusion, there are many cost and capacity benefits to Auto Scale, but care must be taken to clean up all of the additional resources that are provisioned when Auto Scale creates new instances. While you might avoid agents entirely, use inside-the-instance shutdown hooks, prevent termination using AutoScale Lifecycle Hooks or rely on third parties to clean up their data. The best solution, one which allows all resources to be de-provisioned as quickly as possible, is likely a combination of SNS topic messages and Lambda. While there are some issues to contend with as far as maintaining the instance metadata beyond termination, we suspect this solution is the least costly and most robust.

Visit http://www.rackspace.com/aws for more information about Fanatical Support for AWS and how it can help your business. And download our free white paper Best Practices for Fanatical Support for AWS, which offers further detail regarding implementation options for AWS, including identity management, auditing and billing.