This is a guest post written and contributed by Sumeet Sheokand, CTO at GenieDB, a Rackspace Cloud Tools partner. GenieDB provides a solution that goes beyond the concept of failover and provides uninterrupted 100 percent availability for MySQL applications, even in the event of a data center or cloud outage.
The requirements of interactive online applications are dramatically increasing, with users demanding ever-higher levels of availability (up-time). At the same time, data centers and cloud infrastructures are becoming increasingly complex and failure prone. Network connectivity issues and natural disasters all add to the uncertainties surrounding cloud availability. The only solution able to provide 100 percent availability is distribution of the database across geographically separated data centers (geo-distribution).
Limitations of Current Failover Technologies
The prevalent availability solutions built around traditional databases such as MySQL have architectures that never contemplated application environments distributed across multiple data centers, regions or clouds. By fundamentally changing the way databases are replicated, synchronized and maintained in a distributed cloud environment, the roadblocks to continuous availability and faster response times can be removed for users anywhere in the world. This design also builds the foundation for next generation applications and unlocks the full potential of distributed cloud environments like Rackspace Cloud.
Challenges with Traditional Database Replication
Standard MySQL replication is either setup in a master-slave configuration, or in a master-master/ circular replication configuration. Under either configuration, transaction logs are communicated from the master to one or more slaves in a process known as log shipping. Data is thus always out of date on the slaves, a condition referred to as slave lag. The extent of the slave lag depends on the workload, network bandwidth and network latency. In the event there is a failure of a master server (typically not if, but when), MySQL offers the possibility of promoting a slave to become the new master, but this is very painful to do in practice. The cluster has to stop taking any writes while it waits for all existing slaves to catch up and apply the failed master’s logs, effectively quiescing the cluster. Then a new master can be assigned and all clients need to connect to it for writes.
The situation is further complicated if a master server fails leaving different slaves at different bin-log coordinates. Circular replication allows writes at more than one node and thereby introduces a new potential problem: conflicting updates to the same record. The challenges of this conflict resolution have significant implications at both the database and application layers, and the scope of this problem is well documented. A particularly interesting discussion is found in the post on the Facebook Engineering Blog. Facebook ultimately redesigned its application and load balancer, creating a custom version of MySQL to overcome these inherent challenges. Additionally, the loss of any single server while using circular replication is highly problematic, as MySQL does not offer a solution for recovery from a single failed node other than manual re-configuration/restart.
Beyond Failover: Seamless Multi-Data Center, Multi-Cloud Database Distribution
GenieDB has written a whitepaper specifically focused on multi-region, geo-distribution of MySQL to ensure high availability in the cloud, so that an outage in one geographical area or cloud (for example Hurricane Sandy) does not affect application availability. GenieDB solves the conflict resolution challenge utilizing a modified Lamport time-stamping approach that also enables automated healing after a failure or after a period when nodes are disconnected, allowed to receive updates, and then reconnected. You can read more of the whitepaper at http://www.geniedb.com/beyond-failover.
GenieDB is implemented as a plug-in storage engine, working within existing MySQL installations. Our patent-pending technology enables seamless master-master replication across data centers, providing better response times to users anywhere in the world and 100 percent availability during server, network or cloud failures. GenieDB’s automated healing algorithms and plug-in architecture dramatically reduces the total cost of ownership for multi-region deployments and lays the critical foundation for next-generation cloud-enabled applications.