Bill wrote a great article on why we’re using Amazon for data backups instead of Rackspace, where our primary servers are hosted. Here it is, if you’re interested.
This is a post written and contributed by
Right, the SLA for the backups system is not much of a concern. It is acceptable to us if Amazon S3/EC2/SQS has moments of downtime, since it is needed only to push and retrieve backup data. If we were using Amazon’s web services for our production mail system, not backup system, then their SLA would definitely be a concern.
If we experienced a catastrophic failure, ignoring the time it would take to diagnose what happened and rebuild the broken server(s), and focusing just on the data restore time… It would take about 30 hours to restore an entire server worth of data when using one S3 connection. However we have the ability to quickly redistribute users from a failed server to many other servers. And we can make many simultaneous connections to S3. So, say we distributed the failed server’s users to 30 other servers. Each of those 30 server could make it’s own connection to S3. So 30 hours divided by 30 connections = one hour to restore an entire server’s worth of data. We have about 200 of such servers, so as you can see this work can be distributed and parallelized nicely.
Also, since S3 is external to our data center, we could easily redistribute the users to an alternate data center and restore their mail to servers there. Or even do restores from S3 to multiple data centers in parallel. Right now we host our system at just one Rackspace data center, but shh… don’t tell anyone… another data center is coming