Cascading Failures

Cascading Failures

Well, I promised not to discuss my life in here, but since I'm about to use it as an example of generalized system failure modes, I figure it's okay.

The goal: from Waterloo on Thursday morning, travel to Montreal by Friday evening at 8 pm. By car, this is a 7 hour drive. No problem, right?

Well, we were going to rent a car and drive on Thursday afternoon, but it starting snowing/raining/sleeting so we decided not to drive after all; instead, let's take the train, which is safer in bad weather. Since the night train leaves you kind of tired, we decided to take the Friday morning (9:30 am) train from Toronto.

Friday morning, the weather still sucked, but that's okay. We called a taxi at 6:20 am to take us to the bus station in Waterloo. At 7 am, it finally arrived - delayed by bad weather, of course. So we missed the 7am bus to Toronto. No problem, there's an 8 am bus that should still make it to Toronto in time. Unfortunately, the 8 am bus showed up at 8:30 (bad weather), departed shortly afterwards, and got to Toronto at 10:00 (extra late - bad weather). No problem, though; we rescheduled our train tickets from the 9:30 to the 11:30 train. We changed the reservation by cell phone from the bus, luckily, because by the time we arrived all the trains for the day were fully booked. Turns out all the airports were closed (bad weather) and the people taking flights had all switched to the train.

As we were picking up the tickets, they made an announcement that the 11:30 train would be leaving at 12:30 instead - bad weather. No problem: the 11:30 is supposed to get to Montreal at 4:45, so an hour later is 5:45, and even with additional weather delays I should certainly be in Montreal by 8 pm. So, we have some time, let's go for lunch.

At 12:20, we came back and found out that the train had left at 12:07, having been re-re-scheduled while we were gone. In fact, they had made the new announcement before we left the station, but because of a ridiculously loud random music performance (something about the Juno awards) in the middle of the station at the time, all the public announcements were inaudible.

Feeling guilty, they asked us to wait while they figured out what they'd do to get us to Montreal. The result: at 1pm or so, we found out that they could squeeze us on the 3:30 train (arrives around 9:30; useless) or a special 2:30 shuttle bus (could arrive at 8:30 in good weather; useless). So Via Rail wasn't going to be able to help.

Last chance: rent a car after all (there's a rental place at the train station) and drive it to Montreal. That takes at least 6 hours in good weather. By 1:30 we had almost finished filling out the rental forms, meaning that we could be in Montreal by 7:30 on a good day. Sadly, it wasn't a good day. (Interestingly, if we had known at 11:30 that we would miss the train, the rental would have saved us.)

I mentioned above that the airports were closed too (bad weather).

The Moral of the Story

Despite a metric tonne of backup plans (an extra day; an extra bus; an extra train; backup train should still arrive early; could rent a car if the train was cancelled) and slippage, we still didn't get to Montreal on time.

In management, we call this "slippage." In clustering, we call this "cascading failures."

The lesson to learn here is that if you're going to add redundancy (like the extra buses, trains, time, etc) you'd best make sure that the same root cause can't screw up all of your backup plans at the same time. That means don't put a five-station Oracle database cluster on the same power circuit, don't write software that shuts down and expects the cluster to take over if it gets confused (because what if all the nodes get confused by the same thing?), and don't plug all your backup servers into the same Internet connection. For that matter, don't store them all in the same nuclear bunker in the Swiss Alps. If exactly the wrong thing happens, you'll be in trouble.

2003-04-06 »