Welcome to part 3 of this blog series, “Preparing for the big event”. It’s been a while since posts one and two where we explored Auto Scaling, enhancing database performance, caching strategies and CDNs. Black Friday is almost upon us, so it’s an appropriate time to revisit this topic. In this post, we’re going to get a little more technical and explore ‘right-sizing’ your infrastructure to ensure that your cost to performance ratio is optimal. We’ll also look at some enhancements you can make to your EC2 instances to squeeze as much performance out of them as possible.
Right-sizing is the exercise of evaluating each component of your application’s infrastructure and determining whether the chosen instance size or resource is sufficient for the demand placed on it. This can be done from both a cost optimisation perspective, a performance optimisation perspective, or a balance of both. As we’re discussing maximising the performance of your application, we’ll focus on optimising each component for performance.
Earlier in this series we identified application and infrastructure bottlenecks through load testing and monitoring. Hopefully by now you’ve gone through those exercises and made improvements where possible. This is a necessary step before any right-sizing exercise. If you haven’t attempted to improve the performance at the application level first, right-sizing can result in spending money unnecessarily, and bottlenecking issues you thought had been fixed resurfacing over time. It is always best to solve a bottleneck at the application level first, as attempting to out-scale your issue with system resources (CPU, RAM, etc.) will only be a temporary solution.
Before beginning your right-sizing activity, you’ll ideally have at least a month’s worth of Cloudwatch data to base decisions off.
Virtual machines can be used for a myriad of purposes and workloads. This broad spectrum of uses means there’s an equally broad range of usage patterns to evaluate. Some workloads will heavily utilise CPU resources, whereas another will require significant amounts of RAM. The guiding principle here is to identify bottlenecks on CPU, memory, disk and network then make decisions based on those metrics and the performance characteristics of your application.
For most workloads, an average of 65% CPU Utilisation across a month would be considered acceptable. If your particular workload exceeds this over a sustained period of time, it may make sense to consider using an instance type that provides more CPU resources. You may even want to consider using a compute optimised instance if the particular workload in question is heavily dependent on CPU resources. Compute optimised instances generally have newer processors available and provide more CPU resources per hourly charge than general purpose instances.
While memory usage is not monitored in Cloudwatch by default, it’s relatively easy to install the Cloudwatch Agent on to your servers so memory usage can be monitored1.
For the average application, usage should not peak above 80% average over the course of a month. There are exceptions to this however, some database engines will pre-allocate memory to the database, resulting in near 100% usage all the time.
For applications that require large amounts of memory, consider using a memory-optimised instance type. As a general rule, make sure your servers have a buffer of at least 20% available memory.
While Amazon does not publish the exact network performance of every instance type, observing NetworkIn and NetworkOut metrics through Cloudwatch can be valuable in determining if your chosen instance size is being limited by network bandwidth. Look for sustained peak usage. If the network metric will not exceed that particular peak, it may be a sign that the instance has hit its threshold and needs another instance size with more network throughput.
Elasticache comes in two different varieties, Redis and Memcached, both of which behave slightly different. Optimising these two caching engines is a big topic that could cover a whole article on its own. For the purposes of this topic, we’ll look at the Cloudwatch metrics and address them in general terms.
Depending on the engine being used, this can indicate different things. Redis, for instance is a single-threaded application which needs to be taken into account when observing CPU utilisation. To determine the true value, multiply the CPU utilisation value by the number of cores in your instance. If the result is above 80%, it could indicate insufficient CPU.2
This is the amount of memory available to the host. Redis will release memory back when not in use, however Memcached will not. If this value is consistently low, it could indicate an instance type with more memory is required. Alternatively, your application’s implementation of Elasticache may not be optimal. For example, you may not be setting a TTL on records to allow periodic flushing of stale data.
The amount of Swap used by the host – This should always be 0. If you see significant Swap usage, this is an indication that the host has insufficient memory allocated and an instance with more memory should be chosen.
The number of keys that have been evicted due to the MaxMemory limit. If this is above 0, it could indicate that the host has insufficient memory and an instance with more memory should be chosen. Keep in mind, your application should be setting a TTL or expiring records from the cache to prevent evictions.
The number of client connections. This value should remain consistent. If it doesn’t, it could indicate an issue with your implementation of Elasticache within the application.
Selecting an Elasticache cluster size
AWS provides a number of options when selecting an instance type for Elasticache. However, it’s worth pointing out that some of the instance types available are not well-suited to Elasticache. Elasticache is an in-memory database designed to provide low milli-second latency to your applications. With that in mind, memory optimised instances such as the R4 and R5 class are best suited for production workloads.
Instances such as the T2, M3 and M5 class are bested suited to non-production environments or low-demand environments.
Many database performance issues can be traced back to either the database engine configuration or the queries being run against it. Tackling database performance should be done with a holistic view that takes all factors into consideration. With that said, tuning your database engine or improving SQL queries is outside the scope of this article.
On most databases, CPU will vary greatly from 5% up to 100% CPU depending on the traffic to your site, any backups being run, or reporting and business analytics tools in use. Assuming that the high CPU is coming from your application’s typical web traffic and that slow queries3 have been ruled out, consider resizing your RDS instance if CPUUtilization is above 60-70% during busy periods and your application’s performance is being impacted.
Where high CPU can be attributed to reports, business intelligence or complex queries, consider adding a read-replica. A read-replica will allow resource intensive queries and reporting functions to be run against the read-replica without impacting the application’s performance.
This metric will vary depending upon the database engine in use. Typically SQL Server will consume more memory than an engine such as MySQL, depending on the configuration. Have a buffer of 30% freeable memory as a guideline.
ReadIOPS and WriteIOPS indicate the number of disk I/O operations per second. All database engines will place some demand on the disk I/O as new records are committed to disk and records are looked up. However, some applications are very disk intensive on the database. Evaluating ReadIOPS, WriteIOPS and their correlating ReadLatency and WriteLatency metrics will give an indication if the provisioned disk is not performant enough. Later in this article we will take a closer look at EBS performance and steps you can take to remedy an underperforming EBS volume.
Instance type selection
AWS instance sizes are split into five categories. General Purpose, Compute Optimised, Memory Optimised, Accelerated Computing and Storage Optimised. Each category contains a number of instance classes within, tuned for various workload types. Knowing your workload’s requirements and characteristics will allow you to make intelligent decisions around which instance type is the best fit for your needs based on cost and performance. If you don’t have a good understanding of your workload’s requirements or lack the historical data to make a decision, the General Purpose instance types are the best place to start as they balance compute, memory, disk and network. However, once you have good historical data to go on, selecting an instance type in one of the other four categories will, in some cases yield better performance along with cost optimisation.
A note about T2 and T3 instances
T2 and T3 instances have traditionally been seen as a budget instance class, well suited for non-production or low resource intensive workloads. T2 instances provide a baseline level of CPU performance, with the ability to burst above that baseline level through the use of CPU credits4.
This CPU consumption model has meant that T2 and T3 instances have not been suitable for workloads that consistently require CPU performance above the baseline. However, Re:Invent 2017 saw the introduction of T2 unlimited5 which provides a T2 instance with sustained CPU above its baseline for a small additional per-hour cost for every hour that the CPU is above the baseline and surplus CPU credits have been consumed. Combined with Intel Xeon CPU’s capable of a clock speed up to 3.3GHz this makes a compelling case for the t2 instance when compared with the M4 series. However; it’s important to note that in this scenario, a T2.large may end up being more expensive to run than an M4.large. Being aware of how T2 CPU credits and the T2 unlimited feature works before making a decision to implement them in your production environment is critical to ensure both the performance your application requires and avoiding bill shock. In addition, T2 instances cannot be EBS Optimised and have no dedicated network bandwidth.
Additionally, if T2 or T3 instances are used in an Auto Scale group without the T2 Unlimited feature enabled, it’s possible to encounter a scenario where the Auto Scale group cannot scale up because the CPU is throttled below the scale up threshold.
Enhanced networking enables Single Root I/O Virtualization(SR-IOV) on an instance. SR-IOV provides higher performance networking on a virtualized instance by partially bypassing the virtualization layer and accessing the network card of the host system directly. This allows line-rate performance as well as removing reliance on the host’s CPU cycles to virtualize the networking. The end result is higher network performance, lower latency and lower jitter (variability in network performance).
If you’re using a recent Amazon Machine Image (AMI), then chances are that Enhanced Networking is already enabled and working. However, it’s worthwhile checking to ensure it’s turned on even if your application doesn’t have a specific requirement for enhanced networking as it costs nothing to enable it.6-7
If your application writes to or reads from disk frequently, for example, in the case of databases or logging infrastructure such as an ELK/Splunk stack, disk performance will be particularly important in optimising your infrastructure. In the old days before General Purpose SSD (GP2) volumes, the only real option was to create a number of EBS volumes and stripe them together in a RAID configuration. Fortunately, this is no longer necessary.
When evaluating EBS performance, the following metrics will help to determine if your EBS setup is performant enough:
- BurstBalance (if your volume is of the GP2 type)
Calculating required IOPS
To determine the required IOPS for a given workload, we’ll first assume that your workload is already in AWS and you have Cloudwatch data to work with.
The formula for figuring out how many IOPS to provision is:
IOPS Usage = (VolumeWriteOPS + VolumeReadOps) / Time in seconds
You can determine the VolumeWriteOps and VolumeReadOps through Cloudwatch, making sure you select Sum rather than Average. Narrow your Period to 15 minutes and select an Absolute (as opposed to a relative) time period to graph of 15 minutes. Combine the Sum value for both VolumeWriteOps and VolumeReadOps together then divide it by the number of seconds in your time window, in this case 15 minutes.
Our formula would be:
As an example, Suppose we have a 300 GiB volume with the following values:
VolumeReadOps = 2,690,211 VolumeWriteOps = 9,789
IOPS Usage = (2,690,211 + 9,789) / 900 seconds =3000 Total IOPS
In this scenario, if our volume was a general purpose SSD (gp2) volume, will have a baseline of 900 IOPS, burstable to 3,000 IOPS. Our measured IOPS is at 3,000, which is the maximum burstable rate. By looking at Cloudwatch again, we would expect the value for BurstBalance to be dropping rapidly. Once it reaches zero, this volume will be throttled to 300 IOPS. It’s also safe to say the actual required IOPS is greater than 3000, but the application is being constrained by the maximum burstable rate.
If you determine EBS performance is insufficient, you have a number of options:
Provision more storage
EBS Performance is tied to the size of the volume provisioned. For a General Purpose SSD volume type, you get 3 IOPS per GiB of volume size. Additionally, just like T2 CPU credits discussed earlier, EBS volumes can burst beyond their baseline performance by using I/O credits. I/O credits are earned when a volume is performing beneath its baseline performance threshold. Provisioning a volume with more storage will result in a higher baseline performance threshold.
As in our example above, a 100GiB GP2 volume will give you a baseline of 300 IOPS, burstable to 3000 IOPS. However, in our example above, the volume was hitting the maximum of 3000 IOPS. This is a very clear indication this volume should be either re-created as a larger volume (to provide more IOPS) or as a provisioned IOPS volume.
Utilise a Provisioned IOPS volume (io1)
A provisioned IOPS (io1) volume allows you to create a volume and specify a consistent IOPS rate, unlike the burstable performance of GP2 volumes. Provisioned IOPS volumes also have double the maximum throughput per volume (up to 20,000), making them a better choice for high throughput workloads. However, keep in mind that sometimes provisioning a general purpose volume for your performance requirements instead of required storage may be a cheaper option than a provisioned IOPS volume with less provisioned storage and the required IOPS. A cost comparison should be done before committing to a particular EBS volume type.
Provision an EBS Optimised instance
As EBS volumes are essentially network attached storage, they share network bandwidth between the network layer and storage layer. This can cause bandwidth exhaustion and either storage or network performance issues when demand on either the network or the storage of an instance is high. An EBS Optimised instance separates out the network and volume traffic in order to prevent contention and provide maximum performance.
Note: Depending on the instance type, EBS Optimisation may be enabled by default whereas some instance types do not offer EBS Optimisation at all. Refer to the AWS documentation for further information8.
Use instance store volumes
Instance store volumes are locally attached, ephemeral disks. The ephemeral nature of an instance store volume means that all data on the disk will be lost in the event the instance is stopped or terminated. This means you must make regular backups of any important data stored there. However, if your application has high disk IO requirements for non-critical data, for example, buffers, on-disk caches or swap memory, using one or more instance store volumes can be a good option, as they provide higher performance than a network attached EBS volume.
However, it should be noted that instance store volumes can only be used if mounted when the instance is first launched and are only available on certain instance types. Also, keep in mind you will need to bootstrap or script the formatting and mounting of the instance store volume to your instance, as each time the machine launches you will be presented with an unmounted volume with no filesystem or data.
To get the best disk performance possible, consider combining Provisioned IOPS volumes with EBS optimised instances. Alternatively, consider using instance store volumes for specific tasks with high IO requirements on non-critical data.
In this blog post, we’ve covered right-sizing your infrastructure, enabling enhanced networking and selecting the best storage option for your workload. In the fourth and final instalment, we will look at some final steps you can take to prepare your application for a big event including how to get Amazon directly involved.