Coding in the Cloud – Rule 2 – Don't write to the database in real time

Filed in by Angela Bartels | July 9, 2009 2:13 pm

Coding in the Cloud
By Adrian Otto

This continues my series on Rules for Coding in the Cloud[1], rules I’ve developed after watching applications encounter problems at scale when deployed on Cloud Sites[2].

People think about the cloud as an unlimited resource, but there are certain limits and you will reach those limits when you try to do something like writing multiple rows into the database for every single hit to your web site.  For example, if your site gets a million hits in an hour and you write four rows for each hit, you’d write four million rows of new data into your database in an hour.  That use pattern will cause lots of blockage and lots of wasted money. And in some cases, you can produce a write use pattern that can quickly exceed the capacity of the database server to write since writes take on average 10 times longer than any equally sized read.

So if you can avoid it, don’t write to the database.  When you must write to the database, do it very infrequently.  Don’t write based on the access pattern of your web site, or your entire application will fail when it’s under high load. Instead, find a way to individually queue that data in a scalable fashion, summarize it and then add it to the database at an infrequent interval.

Don’t use the database as a web log. If you do, and you have an application that’s running on hundreds of thousands of nodes in parallel, you’ll be unpleasantly surprised by the outcome. It will fail.

So what kinds of applications tend to break this rule?

Ad networks for one. Ad networks are designed to track where ads come from in real time so they can get up-to-date intelligence about the performance of ads. The critical error in logic that some ad network developers have made is that to get real time data you need real time logging, which you don’t.  All you need is real time summary data.  You don’t need the detail level in real time.  You need the detail level for archival, but you don’t need the detail level to get real time intelligence.  So what you really want in the case of serving an ad network is summary counters of the performance of all the various objects that you’re serving, and you want those counters updated in memory resources, not in the database. Every few minutes, read that information out of the memory counters and write it into the database for permanent storage.  This gives you a real time view of the ad network without writing multiple nodes into a database for every single access to every single ad.  We’ve seen multiple ad networks make this mistake, run into scalability constraints, and have to redesign the way their systems work. The principle here applies to all sites, though, not just ad networks.

Because of the way Cloud Sites works, storing something in memory as a summary value may seem rather tricky. The best way to do this is to use a memcached instance running on Cloud Servers[3]. From a PHP application you can use the Memecached class for this. It supports the increment method that will allow you to safely increment the value of a given key from numerous servers simultaneously.

Bottom line: don’t try to write to the database in real time.  Writing to the database in real time is a recipe that will fail at scale. If you’re going to write to the database, do it asynchronously. Take the data in a batched format and save it to the database at regular intervals.

  1. Rules for Coding in the Cloud:
  2. Cloud Sites:
  3. Cloud Servers:

Source URL: