Cassandra 1.0, the cloud, and the future of big data

Jonathan Ellis is the CTO and co-founder of DataStax. DataStax is the developer of DataStax Enterprise, a distributed, scalable, and highly available database platform.

Apache Cassandra has come a long way since I first started writing about it at Rackspace. Since then, I started DataStax to commercialize Cassandra, we’ve had six major releases, run not one but two summits with hundreds of attendees, and now we’re about to release Cassandra 1.0 on October 8.

What’s new in Cassandra 1.0?

Cassandra was born as a hybrid of the best of Amazon’s Dynamo and Google’s Bigtable, but has now moved beyond its parents in several ways:

  • Column indexes: Cassandra allows creating and querying indexes on columns, called “secondary indexes” to differentiate them from the primary key index. Index building is done transparently by the server with no locking out of application requests. Obviously, this dramatically simplifies application development compared to when you had to build your own indexes using Cassandra’s wide row support.
  • CQL: The Cassandra Query Language provides a familiar subset of SQL to interface with Cassandra, reducing the learning curve involved in getting up to speed. (In my view, SQL itself was never the real problem being addressed by NoSQL systems. SQL was the original domain-specific language and does a good job providing a declaritive API to databases.)
  • Performance: Cassandra performance has more than doubled since its early days. Third party benchmarks consistently show Cassandra as the performance leader at scale. With Cassandra 1.0, that continues to improve — expect performance graphs of 1.0 compared to 0.8 on the DataStax developer blog soon.
  • Compression: New in 1.0, compression is virtually a free lunch: trading plentiful CPU cycles for more-expensive I/O, compression both improves performance and increases the amount of data Cassandra can handle per machine.

The cloud and Cassandra

The cloud is about providing infrastructure as a commodity: scaling up and down at will, paying for what you actually need instead of having to build out capacity for your largest spikes, and offloading datacenter operations to specialists.

However, the cloud has had trouble supporting a full traditional application stack: it’s easy to spin up a thousand web servers, for instance, since each can work independently. But most applications require maintaining some kind of durable state, and the relational databases that (until recently) have been our go-to choice for that don’t work that way.

One solution is to use a hybrid cloud: companies like Rackspace that offer both cloud and traditional hosting give you the flexibility of cloud for stateless computation, and specialized, more powerful hardware for database servers and similar core tasks. This is the approach github took with their move to Rackspace.

The other solution is to start using a database that scales across the kind of commodity machines that you find in the cloud. This is the route Netflix took when they moved off of their own datacenters and Oracle to EC2 and Cassandra.

As a side benefit, when you take this approach you can leverage cloud APIs to reduce your ops complexity even further. For example, Cassandra provides pluggable seed provider and snitch APIs that can be pointed at Cloud services to tell the Cassandra cluster “who are my peers in the cluster” and “where are they located,” respectively, rather than configuring these manually via configuration files.

The future of Big Data

In the early days of relational databases, query volumes and data sets were small enough that you could handle your realtime application needs and your analytics with the same database. But these two workloads are different enough that optimizing a single system for both is impossible, so separate systems evolved: OLTP for the former, and OLAP for the latter, although terminology around analytics is less well-defined, including data warehousing, business intelligence, data mining, and others. From this we got systems like MySQL that focused on realtime workloads and others like Teradata focusing on data warehousing, and complex ETL processes to move data between the two.

Today, you still see this split with scalable, NoSQL databases like Cassandra for realtime workloads, Hadoop for big data analytics, and ETL between the two. Maintaining and integrating two different systems causes a lot of operational complexity.

To address this complexity, DataStax is launching DataStax Enterprise, marrying the scalability and reliability of Cassandra with analytics with no ETL, by using Cassandra’s advanced replication to keep the two workloads separate, but seamlessly connected. With this approach, analytical work doesn’t slow realtime processing, but both the analytical side and realtime side can see changes made by the other as they happen.

This is a guest post, the opinions of the author may not reflect those of Rackspace.



  1. I thought there is a way to test out Cassandra running on Rackspace – but can’t seem to find it on the Website. Is there a way? I am conducting some proof of concept work evaluating non-relational data stores and it would help if there is a quick way to be up and running easily — similar to what Microsoft is providing on Azure for Hadoop.


Please enter your comment!
Please enter your name here