Message from Rackspace CEO Lanham Napier
July 9, 2009
Some of our customers have been directly affected by recent outages in a portion of our Dallas-Fort Worth Data Center. Others of you may have heard about it and are following it closely. An interruption like this is not up to our Fanatical Support standards and we are working hard to prevent such incidents from occurring in the future.
On behalf of Rackspace, I sincerely apologize for these disruptions. We know these failures negatively impacted the lives and businesses of our customers. After the disruptions occurred we did our best to recover quickly and explain what happened in a transparent fashion. Since we have seen erroneous and incomplete information on the web and in the media, we wanted to share with you with the most up-to-date and accurate information.
First, some context. Our DFW Data Center has three phases, or sections, and the outages were caused by a malfunction in our power infrastructure in Phase 1. We have redundancy in place, and this redundancy generally works as intended, but these outages show that we clearly have room for improvement.
While we take any outage seriously, it is important to know that this is not a pervasive issue across Rackspace. We operate nine facilities worldwide and the problems in DFW are not affecting our other data centers. Unfortunately, these localized incidents in DFW have had a disproportionate impact on some customers.
Here’s a quick recap of the outages and near-term resolution activities:
- We had a power interruption on June 29, 2009 in Phase 1 of our DFW Data Center, and we moved some of our customers to generator power. The generators then experienced a failure, which caused those customers to lose power to their servers for approximately 40 minutes. We have since performed maintenance and upgrades to those generators, with the help of experts from companies like Cummins, GE and Eaton, and the generators are now stable.
- We experienced another power interruption on July 7, 2009. Again, we moved customers to generator power. During this outage we also suffered a loss of network connectivity due to the power disruption. The part of the power infrastructure that failed (a “bus duct”) prevented proper operation of our UPS for that section, so some customers lost power to their servers for about 20 minutes before we could get them onto generator power. We have since replaced the failed bus duct, and that section of the data center is back to normal and running on utility power.
If you would like more detailed information on the June 29 interruption, please refer to the June 29 Incident Review. The Incident Report for July 7th is forthcoming. I also wanted to speak to all of you in some way other than anonymous copy on a screen, so this morning, I recorded a video which follows this letter.
WHAT WE’RE DOING ABOUT IT
The above resolution steps address the near-term issues. Now we are digging into the actions we need to take to prevent these types of outages in the future. Let me be clear: data centers will experience power interruptions, parts will break, and servers will go down. No data center is completely risk-free. But we can manage and mitigate the risk to acceptable levels, better than we have today, and we can make sure our recovery is as quick as physically possible. I have no doubt that we will get better and stronger from this situation
Our main actions include the following steps:
- Put our best people on it, and bring in the experts. I am personally going to locate myself in our DFW data center until I am satisfied that our repairs and maintenance are complete. We have assembled our best talent from the US and the UK to focus on the issues there. And we have brought in top talent from our vendors, as well as knowledgeable outside consultants, to assist us.
- Assess the status of the infrastructure. We are combing through the power systems in DFW and assessing every link in the chain. Based on the advice of our experts, we will update every piece that needs updating to ensure the performance we require.
a. Phase I has four zones within it. At this point we have completed work on the major power systems for each zone by remediating known deficiencies at the generator and UPS levels.
b. We will continue our work through the smaller components of each zone including switches, breakers and ducting. At this point we have completed all of the work on the smaller components for one of our zones and preventative maintenance on the other three zones is underway.
c. We will complete all of this work as soon as possible with minimal disruption to customers.
- Improve standard operating procedures. We are going to increase the frequency of our testing, monitoring and measurement programs within DFW. Our maintenance schedules will change. And the level of detail we review internally and share externally will increase.
- Invest. We will continue to invest in our infrastructure. We have invested more than $50 million in DFW over the last two years. We invested some of this money in expansion, some to improve our networking and cooling infrastructure, and now we will spend more to improve the capability of the power systems. We will also invest in additional information systems as appropriate to support our new measuring and management procedures.
THE RACKSPACE FANATICAL SUPPORT APPROACH
I would also like to share our Fanatical Support philosophy regarding any downtime or outage situation. Here’s what you should know about how we act:
- Our first priority is getting customers back up. This priority takes precedence over everything else. Customer uptime is a core principle of Fanatical Support, and if you have much experience with us, you know that we take Fanatical Support very seriously.
- We pledge to be transparent. We will do our best to communicate what we know when we know it, and to keep customers and the broader Rackspace community informed. We understand our role in running the Internet, and we know that any missteps ripple out beyond our customers.
- We will fix the problems in a way that minimizes customer disruption. When we experience a disruption or outage, our root cause analysis identifies fixes that improve redundancy and stability. We then undertake these fixes during maintenance windows, or we utilize other ways to prevent customer impact (such as running on generator power during a utility fix). Sometimes, as in the July 7th outage, we experience an additional outage before we have had a chance to completely diagnose and repair all parts of the infrastructure. Note that in the case of DFW, we are confident we have stabilized the power infrastructure, although we will continue to be hyper-vigilant in monitoring and responding to any irregularity.
- We will honor our Service Level Agreements. We think we have the best SLAs in the industry, and we will not hesitate to make it right with our customers when there is a disruption. We will stand and honor our Fanatical Support Promise to our customers.
As always, your feedback is welcome. Please be honest with us about your expectations and how we can do a better job for you. Fanatical Support is in our blood, and times like these are character tests for us. We will do our best to restore your trust in us. I want to thank you, our customers, for standing by us, as well as our Rackers for their tireless efforts to deliver Fanatical Support.
CEO, Rackspace Hosting
Bus duct installation complete: Dallas-Fort Worth data center status * July 9, 2009, 1:45 am CDT: We have replaced the bus duct and successfully returned to utility power on UPS cluster A. The transition started at approximately 11:30 p.m. CDT and was completed at 1:45 a.m. CDT. The DFW data center has returned to normal operating condition.
Status of bus duct installation * July 8, 2009, 6:00 pm CDT: The new bus duct is en route to our Dallas-Fort Worth data center with an expected arrival time between 7:00 pm and 8:00 pm this evening. As soon as we receive the bus duct, we will begin installation and testing – a process which will take approximately 5 hours to complete. We expect to transition from generator to utility power with UPS support on or about 1:00 AM CDT July 9. We will provide additional updates should the schedule change dramatically and upon successful transition to normal operations.
Status * July 8, 2009, 11:55 am CDT: Early this morning, we completed the installation and testing of a temporary bridge that carries power from UPS cluster A to the power distribution units and the cabinets and servers. This temporary bridge is part of the two-phase bus duct replacement process as noted in our July 7th 8:00 pm update.
The second phase of the replacement is the installation of a new bus duct. The new bus duct is being manufactured for us and will be flown in for installation. We expect to receive the bus duct tonight and will immediately begin installation and testing.
Servers supported by UPS cluster A continue to run on generators, which are running reliably and predictably. If necessary, we can switch to UPS and utility power using the temporary bridge. We have experts from our vendor onsite and available to assist with generators as needed.
We will notify customers once we successfully complete installation and testing and before we return these servers to utility power.
Overview and status * July 7, 2009, 8:00 pm CDT: Today, in our Dallas-Fort Worth data center, a part failed causing power interruption and network issues to a portion of the data center. As of 8:00 p.m. CDT, a portion of the data center is running on generator power, and after we have replaced the failed part, we will move that portion of the data center back over to utility power.
Specifically, the part that failed is called a bus duct, which is composed of straps or tubes of metal used to conduct large amounts of electricity. Because a data center consumes substantial amounts of electricity, bus ducts are commonly used in the power infrastructure. In our Dallas-Fort Worth data center, the bus duct failure caused downtime for customer servers that are supported by UPS cluster A. There were also intermittent network performance issues for customers in sections supported by UPS clusters B and E as well. We are still in the process of determining why the bus duct failed and why customers experienced downtime as a result of this issue. Customers supported by UPS cluster A are currently being powered by generators, which are running reliably and predictably.
The bus duct replacement is underway and, when complete, will allow us to switch back to utility power. This replacement comprises two stages, a temporary fix and a permanent fix. The temporary fix will allow us to switch back to utility power if we have any issues with the generators, although we plan to continue to operate on generators for the time being. The permanent fix will use a production part and allow for a permanent switch back to utility power. Customers supported by UPS cluster A should not experience any disruption during this repair work, and we will notify them in advance of the switch-overs.
We realize that although we were able to restore power within minutes, some of our customers were adversely affected and for this we sincerely apologize.
We appreciate your patience and will continue to provide updates as we have new information available.
Status * July 7, 2009, 1:30 pm CDT: Today at approximately 11:00 AM, an electrical connection failed, causing a brief power interruption to customers on UPS cluster A. This failure also may have caused intermittent network performance issues for customers supported by UPS clusters B and E for a short time.
For cluster A customers, we bypassed the UPS and restored power to the servers via generator within a few minutes. Currently systems supported by UPS cluster A are still running on generator power. Repairs are underway and we plan to return to utility power with UPS support as soon as possible. We will follow up with additional updates as new information becomes available.
Update * July 7, 2009, 12:04 pm CDT: We’ve received some questions about whether or not this was a network or power interruption. To clarify, the network issue was related to the power interruption.
Our Dallas data center experienced a network interruption which may have caused a brief loss of network connectivity to some servers. We apologize for any inconvenience this may have caused you and your business. We appreciate your patience while we work through this issue.
Notice * July 7, 2009, 11:44 am CDT: Today a portion of our Dallas data center experienced a brief power interruption. Rackspace is aware of this issue and is currently investigating it. We will be sending out periodic updates as more information becomes available.
Status * July 3, 2009, 6:59 am CDT: We have successfully completed the July 3, 2009 scheduled maintenance on both the A bank generators and the utility breaker. During this maintenance window, we performed the production load test of generator bank A and confirmed that we have eliminated the excitation failures that caused recent customer disruptions. We have returned the DFW data center to normal operating conditions. We will follow up with additional information as necessary. Thank you for your continued patience throughout this process.
Status / Scheduled Maintenance *July 2, 2009, 2:35 pm CDT: We are continuing to research and troubleshoot the root cause of the power interruption in our DFW facility.
As part of our work to improve the reliability and performance of these areas of the data center, we have scheduled a maintenance to generator bank A on Friday, July 3, from 12:01 a.m. to 6:00 a.m. (CDT). Customers who are supported by this generator bank have been notified of this maintenance.
Also, as a cautionary measure, we have asked Oncor, our power supplier, to perform preventative maintenance on their utility breaker during the same maintenance window. Oncor believes this breaker maintenance to be low risk and will be accomplished in less than 30 minutes. However, it requires that we place all customers in phase 1 and 2 of the data center onto generator power. This means that in addition to placing customers on generator bank A onto generator power as planned, we will also place customers supported by generator bank B onto generator power for a brief period during the July 3rd, 12:01 AM to 6:00 AM CDT maintenance window, after the generator bank A maintenance occurs.
We believe no customers will be impacted but want to provide this update to our customers. If you are a customer and have questions, please contact your support team by visiting http://my.rackspace.com or by calling 888-480-7640 or 0800 587 2306, +44(0)20 8734 2700 (UK).
Status * July 1, 2009, 2:45 pm CDT: We are continuing the diagnosis activities to determine the root cause of the interruption. We conducted tests last night on the generators in question and believe we are making progress in understanding what caused the interruption. We have our suppliers and external consultants onsite working with us on this process. We will continue to provide status updates as we learn more.
Message from Rackspace CEO Lanham Napier
June 30, 2009
Yesterday afternoon at 3:15CDT our data center in Dallas experienced an interruption in power to portions of the facility. The interruption caused customer servers to lose power and go down. We sincerely apologize for this disruption and know that it impacted our customers’ businesses as well as the experience of many who use the web. Although we have had some issues with this data center before, please know that we will do what it takes to improve its reliability and performance. We owe you an action plan to prevent this type of thing in the future, and we’ll get that to you as soon as it is ready.
Specific to this situation, here’s what we are doing right now:
- The data center is currently running on utility power.
- We are continuing to research the root cause analysis for yesterday’s generator failures. We have flown in our senior-level engineers from our global operations, and they are working with our external suppliers to determine the cause and how we can prevent this from happening again. We have the best outside experts from companies like Cummins, GE and Eaton.
- We have re-serviced and re-checked our UPS units.
- Tonight at 9:00CDT we will continue our testing of the generator bank in question as we narrow down the variables to determine and remediate root cause.
- Our Support teams will continue to work with all affected customers to ensure they’re up and running.
- We will continue to provide status updates on our customer portal (https://my.rackspace.com/) and on http://www.rackspace.com/blog/. A copy of the incident report that we sent to affected customers can be found at the following link. Though we typically treat our incident reports as proprietary information between us and our customers, we are publicly posting the report for this incident due to high level of public interest that this incident has received.
I want to ensure you that we are doing everything we can to bring this to resolution as quickly as possible. We appreciate your support and understanding. Our promise is Fanatical Support, we believe in it, and we will work with each of our customers to honor that promise.
CEO, Rackspace Hosting
Overview and status * June 29, 2009, 11:26 pm CDT
This afternoon our Dallas data center experienced power interruptions that caused downtime for a portion of our customers. These power interruptions were the result of a range of power infrastructure issues. Right now, the Dallas data center is stable and running on utility power. Our UPS units have been re-serviced and re-checked as of this evening, and we are in the process of doing the same with our generators.
We don’t have a lot of details on exactly what happened yet. When we have an outage, our first focus is on fixing it and getting customers online as soon as possible. Now that we have the near-term situation stabilized in Dallas, we have some work to do to improve our reliability. We will follow up with more information as we work through our root-cause analysis.
Although this outage only affected a portion of our customers in one of our nine global data centers, we consider any outage to be unacceptable. We sincerely apologize to our customers and those who were affected by this downtime. We didn’t serve you as well as we should have today. We are dedicated to Fanatical Support and providing world-class hosting to our customers. Rest assured that the entire Racker family is dedicated to determining exactly what our failures were, and how we can correct them. Thank you for your support on the phone, blogs, Twitter and other forums.
Status * June 29, 2009, 8:58 pm CDT: Section A of the Dallas data center is now back on utility power, and maintenance work on the UPS for that section is complete. No customers were impacted in this transition. We’ll provide further updates as information is available.
Update * June 29, 2009, 7:28 pm CDT: A prior update indicated that utility power was serving the entire data center. However, that update was incorrect in describing the status of one section of the data center (Section A), which is currently still running on generator power while we finish some work with the UPS for that section. After that work is complete, we will transition Section A back to utility power. Customers in Section A are stable while running on generator power, and we are taking every precaution in transitioning this section of the data center back to utility power. We will provide further updates as they become available.
Status * June 29, 2009, 5:55 pm CDT: The Dallas data center is now fully back on utility power. We’ll continue to provide updates as information is available.
Status * June 29, 2009, 5:30 pm CDT: Power has been restored to affected devices. However, some of the devices need to be manually brought back online, and this process is underway. The data center is currently running on a combination of generator and utility power. We apologize for the inconvenience this may have caused you or your customers, and more information will be presented as soon as it is available.