Laying Out a Standardised Infrastructure for AWS

Laying Out Standardised AWS Infrastructure

One of the huge benefits of AWS is the variety of architectural approaches available to address specific requirements. At the same time, this long list of options can also present challenges.

With years of experience managing cloud environments, and through a partnership with AWS, Rackspace has developed opinionated standards and best practices for architecting solutions in the AWS ecosystem. This includes leveraging standardised CloudFormation templates, processes and supporting scripts that:

  • create scalable, resilient, redundant and secure architectures that meet a wide variety of requirements,
  • contain “required” guardrails to ensure Rackspace can operate at scale (snowballs vs. snowflakes), and
  • are flexible and able to accommodate customers’ unique application requirements.

The following diagram portrays the resulting simple infrastructure that adheres to the standards and best practices.

The diagram includes:

  • A single Cloud Formation Template
  • A single VPC
  • Availability Zones (AZ) Options
    • Two AZ deployments are the standard
    • Three AZ deployment to address specific application requirements
  • Subnets
    • Public Tier – could be accessible from the Internet
    • Private Tier– could access the Internet via a NAT environment
    • Subnets in each Tier will have the same network masks
  • High Available Outbound NAT (HA-NAT) – for EC2 instances in the Private Subnets
  • Security Groups – primary method to isolate and secure workloads

Rackspace also leverages the Compass service, an automated system that allows customers the ability to harness the expertise of thousands of Rackspace employees when managing their AWS environments. More information below.

CloudFormation templates

Rackspace will help customers create the necessary CloudFormation templates and stacks to ensure the business objectives of their environments are met. However, a standardised CloudFormation Template, BaseNetwork, will be used to create the initial network and all of its necessary components.

Here is an example of the CloudFormation Base VPC, Network and HA NAT instance scaffolding:

Virtual Private Cloud (VPC) and AWS accounts

For most Fanatical Support for AWS customers, Rackspace will recommend the deployment of a single VPC per account to provide operational simplicity while meeting stringent security requirements. Segregation will be accomplished by creating public, private, and if necessary, protected subnets, and by relying on carefully created security groups that only allow the required granular access.

Rackspace’s recommendation is to create separate AWS accounts for separation of environments (e.g. production, test, development, etc.), and not a separate VPC in the same AWS account.

This is because a second VPC in the same AWS account does not provide additional security benefits, and could complicate the operational processes (e.g. running into EC2 limits for production because a developer launched 10 test EC2 instances earlier that same day in the same account).

Customers will assign a single CIDR block to a VPC. The allowed block size is between a /28 netmask and /16 netmask. In other words, the VPC can contain from 16 to 65,536 IP addresses. Customers cannot change the size of a VPC after it has been created. If a customer’s VPC is too small to meet their needs, they will need to create a new, larger VPC, and then migrate their instances to the new VPC.

It’s recommended that the customer chooses the CIDRs carefully to map with the requirements of the application (e.g. connectivity to on-prem networks). However, most AWS customers typically allocate roughly double the IP addresses for private subnets than public subnets.

Availability Zones (AZ)

Each region contains multiple distinct locations called Availability Zones, or AZs. Each AZ is engineered to be isolated from failures in other zones, and to provide inexpensive, low-latency network connectivity to other AZs in the same region.

An AZ itself can be considered as one or more data centers connected together over low-latency/high-speed links. By launching instances in separate AZs, customers can protect their applications from the failure of a single location. It’s worth noting that each AWS region provides a minimum of two AZs.

Rackspace AZ Recommendations

Rackspace typically recommends a two AZ deployment, which provides availability and redundancy, while reducing complexity, operational overhead and cost. There are situations where a third AZ may be required to address specific application-centric requirements.

Example 1: Mongo’s Election and Quorum constraints require three AZ’s to survive a single AZ failure that contains the primary and a secondary in a three-node cluster. If AZ 1 were to fail with only two AZs, the MongoDB cluster would fail because Election and Quorum constraints were not achieved. With three AZs, Election and Quorum constraints are achieved and the MongoDB cluster remains operational.

Example 2: Applications that have strict load and availability requirements that cannot be met by relying on AutoScale Groups require over-provisioning. Adding a third AZ could reduce costs by lowering the needed over-provisioning.

For example, strict application load and availability requirements dictate 12 servers to be up at all times, even if one AZ fails, (assuming AutoScale cannot scale fast enough during an AZ failure). This requires over-provisioning. Adding more AZs to the architecture would reduce cost, but could potentially add complexity.

Example 3: AWS new Aurora RDS offering requires three AZs

Subnets

For most deployments, Rackspace recommends having two tiers of subnets, public and private.

  • EC2 instances in public subnets would have public IP addresses associated with them and are associated with an AWS Internet Gateway (IGW), thus offering the capability (if required) to access or be accessed by the Internet.
  • EC2 instances in Private Subnets only have private IP addresses and cannot be accessed by the Internet. These EC2 instances have the capability to access the Internet via a NAT server in the public subnets (further info in NAT section below).

If the customer requires systems that cannot access the Internet, then a third tier of subnet (protected) would be deployed without a NAT server and the associated Route Table entries.

  • EC2 instances in Protected Subnets only have private IP addresses and cannot be accessed by the Internet nor can they access the Internet.

Assuming a typical two AZ deployment, four subnets would be required (two for public and two for private) to accommodate redundancy in application deployments.

If there were a situation where a third AZ is required (e.g. MongoDB Servers in the private subnets), then six subnets would be required (three for public and three for private) to accommodate redundancy in application deployments. This would simplify the deployment, and would not create situations where only one tier had three AZs, as opposed to two AZs for the public tier.

It’s important to note that within each tier, all the subnets would have the same netmask to simplify the operational processes (e.g. /24 for all public subnets and /23 for all private subnets).

Unlike traditional networking segmenting approaches that require separate subnets (VLANs) for web-tier, batch-tier, app-tier and data-tier, the AWS use of security groups allows one to leverage just the public and private subnets, and apply specific security groups for each tier (further info in security section below). Thus a deployment would look like:

  • NAT Servers – Public Subnets
  • VPN Servers – Public Subnets
  • Web-tier instances without ELB – Public Subnets
  • Web-tier instances with ELB – Private Subnets
  • Batch-tier instances – Private Subnets
  • App-tier instances – Private Subnets
  • Data-tier instances – Private Subnets
  • PCI workloads – Protected Subnets

NAT recommendations

As mentioned above, NAT servers are required for private instances (instances in a private subnet) to access the Internet. In the recommended “two AZ deployment,” Rackspace recommends leveraging two NAT instances, one for each AZ, so that if an AZ were to fail, the other AZ systems would still be able to leverage a NAT server. If three AZs were to be deployed, three HA-NAT instances would also be deployed.

The HA-NAT instances are created via the CloudFormation Template that:

  • Uses Amazon AMI for NAT servers
  • Creates an EC2 IAM Role to allow the NAT instances to assume a role to update Route Tables of the Private Subnets
  • Creates 2-3 AutoScale Launch configurations to:
    • Launch the NAT servers in the according public subnets (one for each public subnet) with a minimum and maximum of one (for each public subnet)
    • Insert EC2 User Data to run an Amazon sanctioned script

The Script has the following functions:

  • Read metadata for relevant information (e.g. instance ID)
  • Update routing table to make NAT server the default gateway for the subnet
  • Create a necessary security group

Note: external script will be housed in a Rackspace S3 Bucket.

In the event of a NAT instance failure, AutoScale will launch a new NAT instance and the corresponding route table would be updated with the new instance/EIP ID via the scripts defined in the User-Data.

Rackspace Security Model

As a general best practice, Rackspace advises customers to use security groups as their primary method of securing workloads within AWS. While Network ACLs (NACLs) are typically more familiar to networking engineers, they often introduce complexity into AWS architectures.

Security groups provide more granular control, are stateful (therefore more intelligent in allowing appropriate traffic) and apply only to the instance level. By using NACLs as well as security groups, you must consider all traffic in a stateless context (specifying inbound and outbound ports, including any ephemeral ports used by a given application) and these rules are applied at a subnet level. The “blast radius” or potential for impact when a NACL is incorrect or changed is significantly higher, without providing any tangible benefit over the use of a security group.

Rackspace and AWS recommend avoiding NACLs due to potential conflicts with security groups and performance degradation. If there are compliance requirements (e.g. PCI) that specifically call for NACLs, they will be used sparingly and with coarse controls to mitigate potential issues.

There is a possibility that customer applied NACLs may interfere with Rackspace’s ability to manage and operate a customer’s environment, thus additional caution needs to be applied. Rackspace Compass has a check that will identify such scenarios.

Rackspace Compass

We end this blog posting with the Rackspace Compass service. This is an automated system that allows customers to harness the expertise of thousands of Rackspace employees when managing their AWS environments.

The service performs an automated analysis of both the control plane (e.g. data visible through the AWS APIs) and the data plane (e.g. operating system configuration) to provide a thorough look at opportunities to improve the reliability, redundancy and security of the environment.

Rackspace Compass complements the AWS Trusted Advisor service by leveraging the results generated by Trusted Advisor as a baseline set of recommendations, and adds relevant recommendations from the best practices that have developed both with AWS and through more than 15 years of supporting Linux and Windows servers.

Visit Rackspace for more information about Fanatical Support for AWS and how it can help your business. And download our free white paper Best Practices for Fanatical Support for AWS, which offers further detail regarding implementation options for AWS, including identity management, auditing and billing.

LEAVE A REPLY

Please enter your comment!
Please enter your name here