Guidelines And Best Practices For Using Apache Cassandra On AWS

Editor’s note: Datapipe was acquired by Rackspace in 2017.


Apache Cassandra is a massively scalable open source NoSQL database that’s widely deployed in the AWS cloud. It’s ideal for managing large amounts of structured, semi-structured, and unstructured data across multiple distributed locations. Cassandra is a masterless, peer-to-peer distributed system where data is distributed among all nodes in the cluster. Each node has knowledge on the topology of the cluster and exchanges information across the cluster every single second. Running your own Cassandra deployment on Amazon Elastic Cloud Compute (Amazon EC2) is a great solution for any business that runs applications with high throughput requirements.

AWS released a whitepaper with some best practices for using Apache Cassandra, and whether you’ve never touched it or consider yourself a pro, these are great tips to remember. Filled with information and recommendations from DataStax, the whitepaper is recommended to ensure your Cassandra deployment on AWS is as smooth as possible. Here are the highlights:

Planning Regions and Availability Zones

The AWS cloud infrastructure is built on Regions and Availability Zones (AZs). The Region refers to a physical location somewhere in the world, while the AZs consist of one or more discreet data centers housed in separate facilities. Rather than limiting yourself to one data center, the AZs allow the ability to operate applications and databases that are more highly available, scalable, and fault tolerant. A business can use AWS’ global infrastructure to manage network latency and to address compliance needs.

Be careful, though. Data in one region is not automatically replicated outside that region. If your business needs higher availability, you’ll have to replicate that data across regions. Since Cassandra nodes all serve identical purposes, there’s no single point of failure. Because of this, it’s not a bad idea to spread Cassandra nodes across multiple AZs to enable and maintain high availability and uptime. 

Planning an Amazon Virtual Private Cloud

Amazon Virtual Private Cloud (Amazon VPC) gives you complete control over your virtual networking environment. This includes IP address range, creation of subnets, and configuration of route tables and network gateways. AWS strongly recommends launching a Cassandra cluster within a VPC—the enhanced networking feature being a big reason why.

Enabling enhanced networking on an instance results in higher performance, lower latency, and lower jitter. Currently, there are six different instance families that are supported for enhanced networking within a VPC: C3, C4, D2, I2, M4, and R3. If your business uses any of those instance families, it’s a wise idea to utilize a VPC to optimize your overall performance.

Deploying Cassandra on AWS

Not only does Cassandra provide native replication capabilities, it can also scale horizontally. The illustration below demonstrates:

It’s possible to scale vertically using high-performance instances, but vertically scaled instances don’t provide the impressive fault tolerance benefits that come with a replicated topology. Since AWS has a nearly unlimited pool of resources, it’s usually better to scale horizontally. Let’s face it—mistakes happen. Components of the setup can be automated to limit mistakes, For example, using auto-scaling to automate the steps required for replacing a dead node allows less manual intervention. The more you can do to limit mistake and manual intervention, the better.

The AWS cloud provides a unique platform for running NoSQL applications including Cassandra. Its capacities can meet a variety of needs, has a cost based on use, and is easily integrated with other AWS products. Cassandra and AWS provide a robust platform in which a business can develop scalable, high-performance applications. If you have an existing Cassandra cluster that you’d like to migrate onto AWS, the whitepaper gives you the best practices for doing so.

Datapipe’s DataStax Enterprise-as-a-Managed-Service is now available worldwide via our Database Management service—we can plan, build, and run a Cassandra deployment on AWS. As a partner with both of these leading providers–Datastax and AWS–Datapipe engineers are well versed in best practices for these types of deployments. Along with AWS Direct Connect, there are a handful of options for businesses of all sizes to improve their overall performance and address any regulatory or compliance requirements. To learn if a solution like Cassandra is right for you, please visit our Database Services page.

David Lucky is a Product Marketing leader at Rackspace for the Managed Public Cloud services group, a global business unit focused on delivering end-to-end digital transformation services on AWS, Azure, GCP and Alibaba. David came to Rackspace from Datapipe where as Director of Product Management for six years he led product development in building services to help enterprise clients leverage managed IT services to solve complex business challenges. David has unique insight into the latest product developments for private, public and hybrid cloud platforms and a keen understanding of industry trends and their impact on business development. He holds an engineering degree from Lehigh University and is based out of Jersey City, NJ. You can follow David on LinkedIn at linkedin.com/in/davidlucky and Twitter @Luckys_Blog.