When the backup fails. . .

July 25, 2007

The social web (Craigslist, Technorati, LiveJournal, TypePad, AdBrite, Second Life, Yelp and so on) broke down yesterday.  It all started when Pacific Gas & Electric lost an electrical grid in downtown San Francisco leaving much of the city in the dark.  The outage should have caused the generators at 365 Main Street, a popular colocation facility, to start before the facility’s flywheel UPS system ran out of juice.  Tragically, one or more of the facility’s generators failed to start crashing hundreds of servers.  It took 45 minutes to get the generators online and even more time for various customers to reboot their servers bringing the ’social web’ back online.  Nightmare!

Back in the day when I was running colocation facilities I recall a very similar situation.  Our facility located in 2323 Bryan ran on a huge generator.  We would test the system under load each month (as 365 Main claims to do), but early after opening the facility we had an actual power disruption (i.e. unplanned).  The generators failed to start.  Evidently the contractor who installed the generator made a very simple, but hard to find mistake that caused the system to fail under certain circumstances.  We had thrown the switch to cut off building power several times to make sure everything was working properly, but never simulated the conditions an actual failure would cause.  Fortunately we only had a few customers in the facility by the time we learned about our generator’s issues.

Comments