I'm chasing a kernel bug that occurs every 1700 device-hours or so of

I'm chasing a kernel bug that occurs every 1700 device-hours or so of runtime. That amount of time is a combination of "annoyingly infrequent" and "way the heck too frequent if you have thousands of devices." Unfortunately, we found out about this bug only after upgrading those thousands of devices to the new, buggier software version, because it shows up so infrequently that we didn't notice it in testing.

Now, this is more or less good news, because it means we caught all our high-frequency problems and are on to low-frequency problems. We're able to clearly see this problem when we roll out to 1000 customers, and maybe there's no better way to detect such problems (although of course we're constantly improving our stresstests etc).

What happens when we move up to 100,000 customers? Obviously the answer is staged rollouts, where we do only a sampling of customers first, look for problems, then move on. At each stage, we can detect progressively-less-common new problems.

But how many stages and what are the stages? WWGD (What Would Google Do)? Intuitively, I feel like "one rollout phase per order of magnitude" might be a good option to start with, but who knows.

2013-04-18 »