Chris Buckley and Steve Robins contributed to this report.
AWS developed its Well-Architected Framework to help cloud architects build secure, high-performing, resilient and efficient infrastructure for applications. The framework also offers customers a consistent way to evaluate those architectures.
Rackspace was recently selected as one of just 34 companies worldwide by AWS to be part of its Well-Architected Partner Program. As a longtime premier consulting partner, Rackspace has delivered numerous well-architected reviews, or WARs, for our joint customers. This post shares some of the most common findings revealed during our reviews, with suggestions for how to approach these challenges.
For example, we believe it’s critical for business leaders to take part in these reviews along with technical experts for two mains reasons; first, your architecture and design should be driven by business goals, and second, it gives business leaders a deeper understanding of the cost and complexity trade-offs sometimes necessary.
High availability versus cost and complexity
Nowhere are those tradeoffs more apparent than with the level of availability, or the percentage of time an application is available. If architected according to the principles of high availability, an infrastructure can suffer multiple failures while still being able to service incoming requests.
While most customers would love to have 100 percent uptime, it is very difficult and expensive to achieve. Depending on how big the infrastructure is, even the difference between 99.99 percent and 99.9 percent availability can be sizable. We ask customers:
- How long can the business endure a downtime of the system?
- How long should the system take to recover from an incident?
- How much data is expected to be lost and unrecoverable in the worst scenario?
These questions can be tough to answer. But it is vital to understand the realities of IT systems, plan accordingly and design for failures.
Basing infrastructure needs on accurate metrics
We find far too many organizations making infrastructure decisions based on “guesstimates” rather than accurate metrics.
While the initial design for your infrastructure will serve you well initially, as situations change and businesses grow, resources can eventually be either under-provisioned, meaning user experience suffers, or over-provisioned, meaning user may have lost the opportunity to reinvest the money to achieve a higher SLA.
But changes to infrastructure configuration must be based on metrics. Here are several metrics every organization should be collecting and reviewing frequently:
- Resource uptime
- Recovery time objectives
- Recovery point objectives
- Resource utilisation
- Infrastructure costs and projections
- Load testing results
- Application performance monitoring (APM) metrics (e.g. average request time, transition completion time, database query time etc.)
Extending DevOps with DevSecOps
Too many companies have yet to integrate security into their DevOps process.
With increased DevOps adoption, engineers now automate many day-to-day operational tasks, and IT security should be one of them. Incident responses can involve pre-deployed tools or access, change of network permission control or updates of patching policies. DevSecOps aims at integrating the security operation tasks into pre-existing DevOps streams. That can include:
- Automated auditing of access logs and API calls with alerting
- Automated and streamlined patching management
- Automated Infrastructure as Code vulnerability testing in deployment pipelines
- Automated failure injection to validate the resilience of platform (as part of a broader disaster recovery plan)
Creating a data classification scheme
We often come across users who treat all data they store in AWS as equally important.
By doing so, one has either made the infrastructure overly complex and expensive, or taken the risk of not being compliant with regulatory requirements. The questions below can help customers get the infrastructure design and operation model right. Just like when determining the SLA of an application, it is vitally important for the business owner of the application to be involved in answering these questions.
- What kind of data is stored in my AWS environment?
- What are the regulatory requirements for this data?
- Is the protection, access control and encryption of the data sufficient?
- How much data is stored, and how much will be stored in the future?
- How long should I keep the data for?
Implementing effective cost governance
Cost governance is critical, especially as an organization’s AWS environment becomes more complex. As an AWS partner, Rackspace offers comprehensive tools for cost governance. We suggest the following fundamentals for cost optimisation:
- Principal of least privilege access: By only granting the permissions that a user needs to do their job and no more, you can prevent staff from accidentally creating resources they are not authorised to.
- Tagging strategy: It is nearly impossible to track all resources manually. Tags are also the foundation for automated resource management.
- Garbage collection and clean-up: Cleaning up unused resources should be done in an automated fashion by leveraging tags, and on a regular basis. If you don’t have automated tools in place, start doing it manually today, and build the tooling over time.
- Appropriate pricing models: Leverage reserved instances or consider utilising the spot market. This is especially important for production workloads.
More advanced cost governance is always reliant on automation. Automated cost management is an important pillar for any DevOps-enabled organisation, especially those operating in the cloud with its flexible pricing structures.
We hope those common findings help you understand the challenges of efficient cloud operations better. With a well-architected review, your team gets a deeper understanding of how to create the optimal AWS environment for your unique business needs. And not only can Rackspace do the review, our experts can also help implement our findings. Learn more about how our experts can manage AWS for you.
Chris Buckley manages the Managed Public Cloud Onboarding Engineering team at Rackspace in Australia. He has been working daily with AWS for the past five years, for three of those at some of the most highly regarded AWS partners in Australia. A globally recognized AWS evangelist, he served from 2016 to 2017 as an AWS Australian Partner Cloud Warrior, and since 2017 as an AWS APN Global Ambassador.
Steve Robins is a Lead Infrastructure Engineer at Rackspace in Australia who assists AWS customers in achieving Best Practice Design, Operational Efficiency and Cost Optimisation & Governance. Steve has 20 years experience supporting and optimising IT & Telco system Infrastructure and has worked on AWS technologies since 2014, gaining SA Associate, SA Professional and Sysops Associate certifications during that time.