By Adrian Otto, System Architect
The web is huge, and it’s getting bigger every single day. If you’re writing a web scale application that will reach millions of end users, you may need to think carefully about how you write that application so that it will work properly under the demanding workloads the web can produce. Our computing hardware is getting progressively faster and cheaper. This gradual evolution has changed so much, that traditional software development concepts are no longer appropriate for today’s web scale application environments. Memory is plentiful, CPU’s are 64 bit, bandwidth is cheap, and so are CPU’s. Storage is cheaper than ever, but I/O capacity and high-speed network interconnects are not. Considering all of this it’s no wonder that we are generating more data than ever before, and thinking of new and exciting ways to build amazing interactive software solutions, and data analysis systems. To think web-scale today, you need to change gears from what used to make sense years ago.
In this article, I want to highlight 5 concepts to think about when coding, and some tips to help you fulfill these concepts.
Keep Resource Requirements Low
In today’s world of fast CPU’s, plentiful memory, and fast networks, this gets frequently overlooked. Sometimes the simplest way to reduce scalability problems is to make the software procedures more efficient. The faster you can get on and off the CPU, the less data you transmit, and the less memory you occupy, chances are the faster your application will run. The lower the processing time per unit of work the better. Four key resources to consider:
• I/O – Most applications run out of this first, especially when they deal with large volumes of data. Disk I/O is the slowest, but bus I/O can also be exhausted.
• Network – Although bandwidth is more affordable than ever, it’s still easy to run out of bandwidth if you rely heavily on the network for remote data access.
• Memory – It’s cheaper and faster than ever. CPU’s now have lots of cache on them, and shared L2 caches between multiple cores!
• CPU – You can afford lots of this, more cores and more servers helps a ton.
Mitigate Bottlenecks by Running in Parallel
Divide up your workloads into small pieces, and arrange for them to be processed in parallel on separate CPU cores, storage devices, networks, or even separate servers. Keep coordinated synchronization of the processing to a minimum. Locks kill concurrency.
Traditionally, software used a simple central data storage model. Scaling this requires vertical scalability of the database, which becomes increasingly expensive and difficult. Central databases quickly become I/O bottlenecks. Frequently updating status changes in a central data design can be particularly problematic. If you distribute your data over a number of different servers, and put part of your data into each server, you can spread the write load over numerous systems. Using a distributed data store system like Cassandra can help tremendously for this, taking care of all the partitioning of your data for you and making it very easy to add servers in order to increase your capacity.
If you have an application that can use data that is slightly stale then it can use an eventually consistent data storage system. These systems use asynchronous processes to update remote replicas of the stored data. In some cases some users may see slightly stale versions of the data immediately following an update. Decide carefully what data needs ACID properties (Atomicity, Consistency, Isolation, Durability). If BASE (Basically Available Soft-State Eventually Consistent) is sufficient, then you can enjoy superior scalability and availability from a distributed data storage system that utilizes asynchronous replication of the data.
Horizontal Scalability is preferred over Vertical Scalability
Adding resources to a given computer system (more memory, more or faster CPU’s, faster network interfaces) is known as vertical scaling. Although vertical scaling is relatively easy and affordable, you can quickly reach its limits. What happens when you are running the biggest fastest server you can afford? Well, then you need to scale horizontally. This means adding more servers, and dividing the work among them. With a system that scales well horizontally you may be perfectly happy with a multitude of slower systems rather than a single fast server. Writing your software without a plan for horizontal scalability can get you trapped such that your only option is to throw more hardware at it to vertically scale up. If you have a problem that lends itself to horizontal division, you can scale out much further with significant cost savings.
• Concurrency! = Locks. Working horizontally allows you to get lots of work done simultaneously. Try to minimize how much parallel work is synchronized. Too much serialization for data synchronization leads to low concurrency and inefficient use of resources.
• Thread per Connection = Bad. You usually don’t want to have many more than threads than you have CPU cores. If you’re CPU bound and have more threads than cores, you will get much less work done. If your threads are I/O bound it may be acceptable to have a lot more theads than cores, but don’t spread your I/O system too thin trying to do too much at once either.
• Thread Pool with Fixed Number of Workers = Good. It’s much smarter to have a thread pool. You can then have an optimal number of workers dragging work off a queue, and substantially increase throughput.
Helpful Tips for Writing Code that Scales
• Write “stress” test plan first. Lay out your worst-case scenario. Write it down and post it on your wall to remind you. For example: “Support 10,000 concurrent connections with < 1 second response time.” Be sure to quantify exactly what an acceptable result would be under your worst case usage scenario. Keep that in mind at every stage of your software design and implementation. Remind everyone regularly. It’s very easy to get sidetracked and distracted by your feature list and forget your architectural performance and scalability goals. Put it right there in front of you in black and white. Write a test plan in advance before you write a single line of code. It works!
• Cache baby cache. Figure out what data you access frequently, and cache it in memory for repeated high speed access to it. Distributed memory caches like memcached clusters are superior to host based caches for most use cases.
• Compress the data you send over the network. Compression and decompression of data between clients and servers is frequently overlooked as a way to make interactive applications more responsive. It decreases data transfer times, and increases your connection handling capacity per unit of time. The CPU time cost of the compression and decompression is usually trivial compared to the speed benefit you get from it. The overall efficiency of a system using compressed network transmissions is almost always higher than sending data uncompressed.
• Compress the data you store on disk. Try it. Really. Yes, storage is cheap, but I/O is not. By compressing your stored data, you can easily and effectively increase your I/O throughput.
• Consider a sensory driven admission control system when using work queues. A mistake I see over and over again is having a system that puts no limits on its concurrent usage over the network. Let’s say for example you have a system with a maximum capacity of doing 100 things at a time and can produce 100 units of output within an acceptable response time T. If you give it 101 things to do, your output within time T might decrease to 50. If you give it 102 things to do, your output within time T might decrease to 40, etc. Pushing a system beyond its limits can cause it to grind on itself without getting practically any work done. I recommend queuing work and refusing work when your queue gets too long to process. People are reluctant to have visible limits, for rather obvious reasons, but in reality, it’s way better for performance if you control the rate at which you accept new work when you are operating near your practical limits. If you reject work, perhaps your client will receive a visible error message or busy signal. Think about this carefully. Is it better to get a busy signal, or have your phone call mysteriously hang up mid-sentence or sound horrible? That’s right, a busy signal is better. Somehow software developers seem to make no efforts to put busy signal capability into their networked systems. Using a sensible admission control system with an appropriate rejection procedure when queues get too long solves this. If you have a system that scales well horizontally, you could use an API call to provision more Cloud Servers to help you service the queue when the queue length gets too long. This way you can potentially scale your resources to track your demand and avoid the need to reject work at all. This elastic resource provisioning approach is no substitute for admission control, because demand might still exceed available capacity at some point, especially during error conditions. Consider what happens if your system had an infinite loop on itself where the server acted as a client to itself by mistake. Having admission control could break that loop before it took your whole system down.
• Design around seek. Most IO bottlenecks are caused by seeking. When reading or writing data to a disk storage system, avoid operations that will cause the disk(s) to seek. Replace random I/O patterns with sequential ones where possible.
• Keep per connection overhead low. If you need a ton of memory per connection, you won’t get much work done concurrently. If you have say 8 GB of memory and a 100 MB memory requirement per connection, you can run about 80 connections at a time. If you switch to a 10-thread worker pool of 100MB each, and reduce your memory per connection down to 1MB each, you can probably allow 7000+ connections at a time, and get work done at the same rate or faster compared to when you could only support 80 concurrent connections.
• Chatty applications should avoid parsed text protocols using data in XML. If components of your application communicate with each other over the network, or you do lots of communication between the client and server, try to keep the use of text parsed protocols to a minimum. It’s very tempting to use XML or JSON types of text data formats in your network communications because you’re able to run different architectures, etc. It turns out that the server-side resources needed to parse the text formatted data is frequently very CPU intensive, and slows down connection processing times substantially. Keep it lightweight if possible. Consider using a simple binary protocol so that text parsing is not required.
I like to keep system designs as simple as possible. Servicing millions of users makes this hard. From my experience, designing horizontally scalable systems comes at the cost of complexity. Consider carefully what the cost of scalability and efficiency really is before you commit to it. Sometimes just running something on a bigger faster server does the trick, and that’s easy. Adding compression, encryption, admission control, thread pools, eventual consistency, decentralized data, sequential I/O, low per-connection memory overhead, binary network protocols, and distributed caching is not for amateurs. Doing all of that is probably overkill for what most applications need. Simply keep these concepts in mind when you develop your applications so you can implement what makes sense in your application. Depending on your requirements you may be able to make an amazingly scalable system with only a few of these things.
We are working every day to build software and services that make deploying scalable applications easier and easier.
This concludes my article on “Writing Code that Scales.” It’s time to think differently about how you design and implement your software so that you end up with an efficient scalable system when it’s all done.