San Francisco Power Failure Killed Many Web Sites
Wow! It turns out that Tuesday was a great day to not be working as a SysAdmin in San Francisco:
(I like what Yelp have done with their down page.)
The short story is that an underground transformer exploded downtown, and the 365 Main data center failed to automatically start their generators, and had to start them manually, cutting power for nearly an hour for some customers, many of which are smaller, trendier web sites like Craigslist, LiveJournal, Yelp and others. (I have interviewed with half of the companies mentioned in Scott’s post.)
You do not want to lose power across a production-class network. This can cause equipment failure, servers to delay boot because they need to run disk consistency checks, servers to stall boot noting a missing keyboard, disk errors, or whatever. Some services may wedge up because when they started they couldn’t talk to the database . . . in some cases you may have had machines running for a few years, which may have last rebooted three SysAdmins ago. The running state may be subtly different from the boot state, with no documentation . . .
A few years ago I had a chance to rebuild a production network from the ground up, with a decent budget to do everything the right way: redundant network switches, serial consoles, remote power management . . . I remember talking to my manager as to whether we might want a UPS in each rack. We figured that the data center is supposed to keep the power running, or else. Also, if the data center loses power then we lose our network access anyway . . . perhaps the whole point of this post is that data centers do lose power, so a UPS can be worthwhile. If nothing else, it may leave your systems up and ready to go as soon as the network is restored.
Data centers have UPSes too. Huge ones that you may get to walk through on a tour. The purpose of the UPS is to provide battery power between the time utility power fails and on-site generators begin to provide energy. I don’t know enough to comment on this particular case, but I do recall touring a data center in Emeryville, and the guy explained that batteries become less effective over time, and a lot of data centers fail to test their batteries regularly. When wired in series, one bad battery brings down the entire UPS, and so even though you have a generator on-site, the UPS can fail before you manage to transfer to generator power. While this stuff is beyond my expertise, I’m inclined to believe that this is what happened at 365 Main yesterday: a data center should not only test its failover-to-generator procedure on a regular basis, they need to ensure sufficient battery capacity to keep systems running during the time it would reasonably take to switch to generator power.
Update–July 27: Earth2Tech points out that 365 Main uses newer, ecologically-conscientious flywheel technology to maintain current between the switch from utility to generator power, and speculate that the flywheel may have played a part in the power failure. Their writeup is very good, overall.