Cassandra By Example

EDITOR’S NOTE: This article describes an obsolete version of Apache Cassandra. For tutorials covering modern Cassandra, please visit

Cassandra has received a lot of attention of late, and more people are now evaluating it for their organization. As these folks work to get up to speed, the shortcomings in our documentation become all the more apparent. Easily, the worst of these is explaining the data model to those with an existing background in relational databases.

The problem is that Cassandra’s data model is different enough from that of a traditional database to readily cause confusion, and just as numerous as the misconceptions are the different ways that well intentioned people use to correct them.

Some folks will describe the model as a map of maps, or in the case of super columns, a map of maps of maps. Often, these explanations are accompanied by visual aids that use a JSON-like notation to demonstrate. Others will liken column families to sparse tables, and others still as containers that hold collections of column objects. Columns are even sometimes referred to as 3-tuples. All of these fall short in my opinion.

The problem is that it’s difficult to explain something new without using analogies, but confusing when the comparisons don’t hold up. I’m still hoping that someone will devise an elegant means of explaining this, but in the meantime I find concrete examples to be worth their weight in gold.


Despite being an actual use-case for Cassandra, Twitter is also an excellent vehicle for discussion since it is well known and easily conceptualized. We know for example that, like most sites, user information (screen name, password, email address, etc), is kept for everyone and that those entries are linked to one another to map friends and followers. And, it wouldn’t be Twitter if it weren’t storing tweets, which in addition to the 140 characters of text are also associated with meta-data like timestamp and the unique id that we see in the URLs.

Were we modelling this in a relational database the approach would be pretty straight-forward, we’d need a table to store our users.

    username VARCHAR(64),
    password VARCHAR(64)

We’d need tables we could use to perform the one-to-many joins to return followers and followees.

CREATE TABLE followers (
    user INTEGER REFERENCES user(id),
    follower INTEGER REFERENCES user(id)

CREATE TABLE following (
    user INTEGER REFERENCES user(id),
    followed INTEGER REFERENCES user(id)

And of course we’d need a table to store the tweets themselves.

    id INTEGER,
    user INTEGER REFERENCES user(id),
    body VARCHAR(140),
    timestamp TIMESTAMP

I’ve greatly oversimplified things here for the purpose of demonstration, but even with a trivial model like this, there is much to be taken for granted. For example, to accomplish data normalization like this in a practical way we need foreign-key constraints, and since we need to perform joins to combine data from multiple tables, we’ll need to be able to arbitrarily create indices on the appropriate attributes to make that efficient.

But getting distributed systems right is a real challenge, and it never comes without trade-offs. This is true of Cassandra and is why the data model above won’t work for us. For starters, there is no referential integrity, and the lack of support for secondary indexing makes it difficult to efficiently perform joins, so you must denormalize. Put another way, you’re forced to think in terms of the queries you’ll perform and the results you expect since this is likely what the model will look like.


So how would the model above be translated to Cassandra? Fortunately we need only look as far as Twissandra, a functional albeit minimalist Twitter clone written by Eric Florenzano, specifically to serve as a sample. So lets explore data modelling in Cassandra using Twitter and Twissandra as an example.


Cassandra is considered a schema-less data-store, but it is necessary to perform some configuration specific to your application. Twissandra comes with a sample configuration for Cassandra that should Just Work, but it’s worth taking some time to look at the specific aspects related to the data model.


Keyspaces are the upper-most namespace in Cassandra and typically you’ll see exactly one for each application. In future versions of Cassandra, keyspaces will be created dynamically similar to how you create databases on an RDBMS server, but for 0.6 and before, these are defined in the main configuration file like so:


Column families

For each keyspace there are one or more column families. A column family is the namespace used to associate records of a similar kind. Cassandra gives you record-level atomicity within a column family when making writes, and queries against them are efficient. These qualities are important to keep in mind when designing your data model, as you’ll see in the discussion that follows.

Like keyspaces, the column families themselves are defined in the main config, though in future versions you’ll create them on the fly similar to the way you create tables in an RDBMS.

One thing worth pointing out from the config snippet above is that in addition to a name, column family definitions also specify a comparator. This highlights another important distinction from traditional databases in that the order records are sorted is a design decision, and not something that can easily be changed later.

What are these column families?

It’s probably not immediately intuitive what all seven Twissandra column families are for, so let’s take a closer look at each.

  • User

This is where users are stored, it is analogous to the user table in the SQL schema above. Each record stored in this column family will be keyed on a UUID and contain columns for username and password.

  • Username

Looking up a user in the User column family above requires knowing that user’s key, but how do we find this UUID-based key when all we know is the username? With a relational database and the SQL schema above, we’d perform a SELECT on the users table with a predicate to match the username (WHERE username = ‘jericevans’). This won’t work with Cassandra for a couple of reasons.

First off, a relational database will scan your table sequentially when performing a SELECT like this, and since records are distributed throughout a Cassandra cluster based on key, the equivalent could mean contacting more than one node (possibly many). However, even with all of the data on a single machine, there comes a point when such an operation will become inefficient with a relational database, making it necessary to index the username attribute. As mentioned earlier, Cassandra doesn’t currently support secondary indices like this.

The answer is to create our own inverted index that maps readable usernames to the UUID-based key, and that is the purpose of this column family.

  • Friends
  • Followers

The Friends and Followers column families will answer the questions, who is user X following?, and who is following user X?, respectively. Each is keyed on the unique user ID, with columns to track the corresponding relationships and the time they were created.

  • Tweet

This is where the tweets themselves are stored. This column family stores records with unique keys (UUIDs), and columns for the user id, the body, and the time the tweet was added.

  • Userline

This is where the timeline as it pertains to each user is stored. Records here consist of user ID keys, and columns to map a numeric timestamp to the unique tweet id in the Tweet column family.

  • Timeline

Finally, this column family is similar to Userline, except that it stores the materialized view of friend tweets for each user.

So, given the above column families, let’s step through some common operations and see how they would be applied.

Adding a new user

First off, new users will need a way to sign up for an account, and when they do we’ll need to add them to our Cassandra database. For Twissandra, that would look something like:

username = 'jericevans'
password = '**********'
useruuid = str(uuid())

columns = {'id': useruuid, 'username': username, 'password': password}

USER.insert(useruuid, columns)
USERNAME.insert(username, {'id': useruuid})

Twissandra is written in Python and uses Pycassa for client access, so the uppercase USER and USERNAME above are pycassa.ColumnFamily instances that would have been created elsewhere during initialization for “User” and “Username” respectively.

Also, this is a good time to mention that this and the code samples that follow aren’t verbatim snippets from Twissandra, I’ve changed them to be more concise and self-contained. For example, in the code above, it wouldn’t make sense to assign variables for username and password, in a web application these would be taken from the form elements on a sign-up page.

Getting back to the sample, there are two different Cassandra write (insert()) operations taking place here, one to create a new record in the User column family and one to update the inverted index that maps human readable usernames to UUID keys. In both cases, the arguments to insert() are the key that we’ll later use to look up the records, and a map containing the column names and values.

Following a friend

frienduuid = 'a4a70900-24e1-11df-8924-001ff3591711'

FRIENDS.insert(useruuid, {frienduuid: time.time()})
FOLLOWERS.insert(frienduuid, {useruuid: time.time()})

Again we perform two different insert() operations, this time to add someone to our list of friends, and to track the inverse of that relationship, the addition of a new follower on the target user.


tweetuuid = str(uuid())
body = '@ericflo thanks for Twissandra, it helps!'
timestamp = long(time.time() * 1e6)

columns = {'id': tweetuuid, 'user_id': useruuid, 'body': body, '_ts': timestamp}
TWEET.insert(tweetuuid, columns)

columns = {struct.pack('>d'), timestamp: tweetuuid}
USERLINE.insert(useruuid, columns)

TIMELINE.insert(useruuid, columns)
for otheruuid in FOLLOWERS.get(useruuid, 5000):
    TIMELINE.insert(otheruuid, columns)

To store a new tweet, we create a new record in the Tweet column family using a newly created UUID as the key, with columns for the author’s user ID, the time it was created, and of course the text of the tweet itself.

Additionally, the user’s Userline is updated to map the time of the tweet to its unique ID. If this is the user’s first tweet the insert() will result in a new record, subsequent inserts will create new columns in this record.

Finally, Timeline is updated with columns that map time to tweet ID for this user and each of their followers.

One thing worth paying particular attention to here is that the timestamp used is a long (64 bit), and when it is given as a column name, it’s packed as a binary value in network byte-order. This is because the Userline and Timeline column families use a LongType comparator, allowing us to query for ranges of columns using numeric predicates, with results that are sorted numerically.

Getting a user’s tweets

timeline = USERLINE.get(useruuid, column_reversed=True)
tweets = TWEET.multiget(timeline.values())

Here we’re retrieving the tweets for a user, first by obtaining a list of the IDs from Userline, and then fetching them from the Tweet column family with a multiget(). These results will be sorted by the numeric date/time, and in descending order since Userline uses a LongType comparator and reversed was set to True.

Retrieving the timeline for a user

start = request.GET.get('start')
limit = NUM_PER_PAGE

timeline = TIMELINE.get(useruuid, column_start=start,
column_count=limit, column_reversed=True)
tweets = TWEET.multiget(timeline.values())

Just like the previous example, here we’re retrieving a list of tweet IDs, this time from Timeline. However, this time we’re also using start and limit to control the range of columns returned. This is handy for paging through the results.

So what next?

Hopefully this is enough to give you the general idea. Again, I took some liberties with the code samples and omitted some operations in an effort to be concise, so now might be a good time to check out the Twissandra source and take a deeper dive. There are a number of obvious features like retweet and lists, that were intentionally left unimplemented to serve as exercises for the initiated. If you’re comfortable with Python and Django, you might try your hand at one of those.

The wiki contains a growing base of information, including an up-to-date list of articles and presentations that people have given.

If IRC is your thing you can join #cassandra on where you can chat with people who have Been There and Done That and are always willing to help with questions. Or, if email is more your style there are also plenty of helpful folks on the cassandra-user list.

Rack Blogger is our catchall blog byline, subbed in when a Racker author moves on, or used when we publish a guest post. You can email Rack Blogger at


  1. Very nice. Thank you. It would be great to see some example schemas for a few other things too to get a sense. For example, a social network, a blog _with comments_ etc. It’s easy to understand Cassandra’s data model, but it’s hard to know how to design a good schema right now.

    Also, lots of folks seem to be using lazyboy. What’s the advantage of Pycassa?

  2. Also, I notice you’re putting the tweet ID into the timeline. I would have guess it would be faster to put the actual tweet text onto the timeline. Why just put the ID?

    • The Timeline column family is used to determine which tweets to display on a user’s home screen (in Twitter terms, what you’d see at when logged in). That page makes use of all Tweet columns, the author, the time it was created, and the ID (used for the perma-link).

  3. Is the Cloud the right solution? What type of cloud – private, public or hybrid? Which cloud platform should you choose?

    Get the answers to these questions and get started on the Cloud for free at:

    When you join the Microsoft Front Runner program, you can access one-on-one technical support by phone or e-mail from our developer experts, who can help get your applications in the cloud. Once your application is compatible, you’ll get a range of marketing benefits to help you let your customers know that you’re a Front Runner.

  4. This is the most easily accessible tutorial for Cassandra I’ve come across! I actually understand what ColumnFamilies are for now. Thanks for the article.

  5. Reproducing a test case on my machine, I face a syntax error in your tweeting code:

    columns = {struct.pack(‘>d’, timestamp: tweetuuid}

    Thanks for the write-up, that definitely helps understanding the data model.

  6. […] Cassandra By Example | Rackspace Cloud Computing & Hosting […]

    • I’m not sure what you mean here by “entities”. As a rule of thumb, you want to push as much as possible into as few a column families as you can, (you want to denormalize). I think there is still some room to take this further than Twissandra does, but not without making it more difficult to follow (and as a result, less useful as a demonstration tool).

  7. I’m confused about the User ColumnFamily and the need for the Username one. Are you assuming duplicate Usernames? If not, why not get rid of the UUID and key against the actual username? Thanks – nice article.

    • That’s a good question, and yeah, you could save the secondary index by keying directly off of the username. You’d lose some flexibility in the process though. For example, it would no longer be practical to change a username (Twitter allows you to do this).

      • Ahh, changing username would then be problematic. There’s always the debate between synthetic and “real” primary keys. I wonder if the flexibility of allowing username changes outweighs the performance gain of getting rid of one look-up. I suppose with smart caching though the username -> uuid look-up happens very infrequently.

    • Disk is cheap – as are bulk writes. So why not? It’s better then doing a join over a table with 50 billion rows for sure.

  8. Thanks! Very nice article especially when there’s no book about Cassandra available yet.

    P/S: It seems that the excerpts in Keyspaces & Column Families parts now are missing (I saw them before).

  9. As I understand this is still relational database model,the difference is
    1.Have to use the revert map key-value to stand as second indexes.
    USERNAME.insert(username, {‘id’: useruuid})
    columns = {struct.pack(‘>d’), timestamp: tweetuuid}
    USERLINE.insert(useruuid, columns)

    2.Not use SQL to do the CRUD operation.


  10. Hi, can we get a pictorial representation of the Cassandra data model for this Twissandra application? If it reflects the difference between the RDBMS and Cassandra design, it would be of great help.

  11. Hi Eric, great article! Very elegant introduction to cassandra; it’s very much appreciated.

    While I understand the concept, thanks to your article, I have a somewhat unrelated question. I made a php-script (I know, this can be the reason why) which adds a tweet, and then retrieves the timeline. My exec times seem extremely slow, so I’m wondering if I’m doing something completely wrong here.
    Insertions are very fast (0.000x sec), but retrieval, even with 10-50 tweets in total, takes 1.5-2 seconds.

    I realize this article doesn’t deal with caching or even optimization (as it is not the point, I’m aware), but would you mind saying a line or two about this if possible?
    Scaling my amateur script to 1 mio. users tweeting 2-3 times a day doesn’t sound at all plausible with my current cassandra skills 😉

    – Chuck

  12. Great post!

    A somewhat tangential correction:

    You state that the presented relational schema is normalized, but I think that the followers and following tables duplicate the same piece of information. That is, when userA decides to follow userB you’d need to insert a row into each of these tables. However, you only need one table with a composite primary key to store directed edges in a graph.

    • I’ve noticed this too – in a RDBMS you clearly don’t need two tables in this case. So no, the schema is not 100% pure. In the Cassandra data model this is clearly required to allow efficient querying (get all followers for user X, get users the user Y is following) because of lack of secondary indexes. So I guess this somehow ‘infected’ the relational model …

  13. […] Here is a good example of how to create a basic Cassandra database and begins to explain how data is laid out in a No SQL database versus a normal SQL database.   It uses an example based on a simple twitter example with 3-4 tables so that you can understand what is going on and no get all twisted up in the data model. Cassandra By Example. […]

  14. I’m a bit of a newb with cassandra and, as one reader already pointed out, there is a problem with the struct.pack function:
    “columns = {struct.pack(‘>d’), timestamp: tweetuuid}”

    This threw me off but I think I got what it’s supposed to do.
    The “columns” variable stores the key value pair timestamp: tweetuuid. However, timestamp is run through struct.pack to convert it to binary data string. So should the correct syntax read as follows:
    columns = {struct.pack(‘>d’, timestamp): tweetuuid}

    Is this correct? I can’t get my interpreter to process that line any other way.

  15. The first article that I found explaining the distributed data store concept to database-experienced persons. Amazingly detailed and nice-to-read.

  16. can anyone help me how to use Cassandra with Spring framework in Netbeans?
    please help if you have any documentation or any past experience about this..

  17. Excellent article. My understanding is maps of maps of maps of maps of….
    from the highest to the lowest level depending on the data model, so it becomes a ‘data map’. What comes out is the level of redundancy , and the speed it gives.
    The twitter/twissandra example is simple and one wonders why it was called complex in the first place.

  18. Great article. I have one additional question on this. How do I maintain atomicity between various inserts done in above example.

    tweetuuid = str(uuid())
    body = ‘@ericflo thanks for Twissandra, it helps!’
    timestamp = long(time.time() * 1e6)

    columns = {‘id’: tweetuuid, ‘user_id’: useruuid, ‘body’: body, ‘_ts’: timestamp}
    TWEET.insert(tweetuuid, columns)

    columns = {struct.pack(‘>d’), timestamp: tweetuuid}
    USERLINE.insert(useruuid, columns)

    TIMELINE.insert(useruuid, columns)

    How can we ensure that all the inserts (TWEET, USERLINE & TIMELINE) are successful?

  19. Wow. great job! It’s amazing, thank you so much for gathering all this information in one post with examples, it’s very clever and will be extremely helpful for all people who use Cassandra.


Please enter your comment!
Please enter your name here