Often, a big media event can take your operations team by surprise. Sometimes (if you’re lucky), the marketing department may give no more than a few days’ notice. They might assume that because the application hums along nicely today, it should be able to handle drastically increased traffic without any major planning or preparation.
Or, perhaps they’ve been told that, because it’s hosted “in the cloud”, the website can scale infinitely to match any onslaught of traffic. The bottom line is, architecting, scaling and planning are often oversimplified or misunderstood by non-technical stakeholders.
In this blog series, I will explore the various actions an organization should take when preparing for a big event, whether it’s expected, such as a marketing campaign or an online sale, or unexpected, like your application going viral.
Amazon Web Services provides a platform that enables companies of all sizes to deliver digital content and platforms at scale. Gone are the days of purchasing servers and racks in a facility to physically host them, hoping you have enough capacity to serve peak demands.
A company can now architect their applications to scale up and back down again, utilizing only the resources needed for a particular time-frame. This, however, is not the end of the story. Intelligent architecting for elastic scaling to handle loads of any size is critical to ensuring that your company does not become a victim of its own success. Leveraging the full capabilities of the cloud takes forward planning and testing.
The importance of properly load testing your application cannot be understated, whether the goal is to gauge preparedness for a specific event or to simply understand how your application performs under load and what its current limitations are.
Load testing, in general, should not be seen as a one-time event, but a continuous practice throughout your software development lifecycle. Doing so will provide valuable insights and data that will help drive decisions and improve the performance of your application.
The method of load testing is also important to look at. If load testing is an afterthought or done on a limited budget, you may be tempted to use a fairly rudimentary tool that simply sends thousands of simultaneous hits to your front page. But soaking your servers with unrealistic traffic and watching your application crash is not going to be of any real benefit. Whether you decide to undertake the load testing yourself or use a third-party, several factors need to be considered:
- Have a key goal in mind
- Is there a specific traffic goal you want to ensure your infrastructure can handle (page loads or unique clients per second) or are you just searching for the limits of your platform?
- What does the performance of each component of the system (e.g. web servers, application servers, caches, databases) look like at this load?
- Is the user experience at this load acceptable? For example, if the page takes 10 seconds to load, will you lose customers?
- Use a service that will send hits from multiple IPs and clients
- This will enable you to observe how the load balancing handles traffic. A service that sends all requests from the one IP will not effectively simulate real traffic as the load balancer will end up sending all requests originating from the same IP to the same back-end instance.
- Simulating multiple clients can show differences in client-side performance when the site is under load for your mobile users separately to your desktop users, or visitors using Chrome as opposed to Edge.
- Use a service that will simulate the user journey
- A good load tester will be able to simulate the user journey through your web application, reflecting the actual traffic patterns of your users (and thus testing your entire platform) and giving you real-world data to analyze.
- Determine what kind of load test you are going to perform — different performance tests allow you to draw different conclusions about your architecture. For example:
- Stress test
- A stress test will continually ramp up traffic to a site beyond what it is designed to handle. This can be helpful in determining the upper limits of your application and observing its behavior as it begins to fail under load.
- This will help you design your site to fail gracefully.
- Soak/endurance test
- Usually, involves running a controlled load at or around business-as-usual load and observing the effects over a sustained period of time. Soak tests would typically be used to identify issues that are not immediately visible from a stress or spike test. For example, memory leaks, log rotation or disk space issues.
- Spike test
- Applies a rapid increase of load to a system that may already be under load and measures the ability of the application to recover from sudden short bursts. This type of test can be useful in determining how an application handles a rapid influx of traffic before quickly dying down again. For example, a radio advertisement driving traffic to your site.
- Stress test
Without metrics, your performance testing will be largely meaningless. Implementing adequate monitoring will benefit you operationally. It will allow you to gather data on the performance of your systems, assist in diagnosing and responding to issues as well as inform your design and architecture decisions over time.
Monitoring can be broadly split into two categories: application performance monitoring and system/infrastructure monitoring.
Application performance monitoring
Tools such as Amazon X-Ray or New Relic will help you drill into the performance of your application at a transactional, per-function level and perform in-depth analysis on the behavior of problematic transactions or components.
Once you’ve identified the transactions or components within the system that are not performant, improvements can be made. This might be a SQL query that can be optimized, a function within your application that needs refactoring or a third-party API that suffers from latency.
Once you’ve made improvements to your application and released those changes to your staging (or production) environment, you should measure the impact this has had on performance. New Relic has a feature that can compare performance between software releases. If you’re using other tools for application performance monitoring, you’ll want to record baseline performance before a release, then perform the same test afterward so you can measure this.
On top of all this, many APMs are able to report on the user experience with metrics on rendering time of your page. An application is not performant if the end-user experience is poor.
Infrastructure monitoring with tools such as Amazon’s CloudWatch will help you understand how the infrastructure-level components are handling the load on your platform. Cloudwatch can also be configured to alarm on metrics, giving you the ability to use automation for self-healing or alerting you directly to issues. However, for now, we’ll concentrate on Cloudwatch in the context of gathering performance data. There are a number of key metrics that should be tracked for every component of your system:
- CPU utilization
- What is consuming CPU time? Is this utilization expected or a process that has gone rogue?
- Is your application single or multi-threaded?
- Do you have multiple workloads on a single server? Can or should they be split out to separate servers?
- If this is a database server, correlating your SQL queries with CPU cost may highlight expensive table joins or scans that could be improved.
- Memory utilization
- Note; at this point in time, CloudWatch does not natively support monitoring of memory. However, this can be done quite easily with custom CloudWatch Metrics through the use of either network monitoring scripts or the Systems Manager Stage Manager
- What is consuming memory? Perhaps this is normal and expected or perhaps you have a memory leak that needs fixing.
- Correlating memory utilization with page file usage and disk IO may reveal a misconfigured operating system.
- Disk performance
- If disk IO is higher than expected, evaluate what is causing this. For example, does the server have insufficient memory and thus is using swap excessively?
- If this is a database server, perhaps tuning the database engine to make better use of resources should be considered.
- If your application makes heavy use of disk IO for scratch or temporary files, consider using instance store volumes. Because instance store volumes are locally attached, they provide better performance in most cases than EBS. However, Instance store volumes are ephemeral, meaning all data is lost in the event the instance reboots or is stopped. Make sure you’re persisting important data elsewhere or are prepared to lose it. (I’ll dive further in to EBS performance and options later in this blog series)
Monitoring the CPU, disk and memory of the individual servers that compose your cloud infrastructure will also help you to make informed decisions about instance sizing. Correlating this information with the data from your APM solution may inform you that your web server’s configuration could be adjusted.
Finally, if you’ve made all the enhancements you can make to the underlying application and configuration, you’ll then be able to make data-driven decisions around instance sizes, scaling policies and alerting thresholds. The end result will be a more performant application that uses its available resources more efficiently, and optimizes the ‘cost per click’ of your application.
So far, we’ve looked at just two aspects of preparing your web application for a big event, load testing and monitoring. In part two, we’ll take a look at changes you can make to the infrastructure that will help you to scale beyond current limits and maintain availability of your application.
Want to find out more about optimizing your web applications using AWS? Visit Rackspace to learn about the ways we’re helping businesses with their AWS architecture every day.