Feynman on actual Quality "Assurance"

Feynman on actual Quality "Assurance"

I just read Richard Feynman's analysis of the space shuttle Challenger explosion from 1987. I was surprised, first because it was completely readable and understandable, and second because he has such an englightening way of looking at what went wrong - namely NASA's quality control processes at the time.

The article is well worth reading for yourself, but two insights in particular stood out for me. First, his analysis of the turbine design and testing processes, which nowadays we in software would call "big bang integration" testing. Design, build, and assemble all the parts, then test the completed unit. When it fails, try to guess what went wrong and work around it. At that late phase, you're testing the entire engine. Just the testing is expensive, and changes that would have been easier before (like redesigning the engine cavity for a different propeller shape) are so expensive and slow as to be impossible. So the solutions implemented so late are necessarily band-aid solutions that may or may not fix the root cause - sound familiar?

Then, as the schedule slips and slips, quality requirements are slowly relaxed out of desperation. His proposed solution is to thoroughly test each component repeatedly, early on, in order to reduce the cost of finding and fixing the bugs. That reduces the number of bugs found in the later integration testing phase... sound familiar?

He contrasts this with the way the software group did their testing. Essentially they were doing unit testing, pair programming, and rapid iterations - in 1987! at NASA! - but that's not the interesting part. What's interesting to me is the way they arranged their QA team.

Books like Peopleware talk about organizing your QA team as a group of adversaries to your development team. Their job is to cause trouble by breaking the developers' software. So far, so normal, but here's the problem: I've seen people try that, and I tried it myself. It doesn't really work. It gives the developers an excuse for bugs that don't get caught. So as you start off, the developers get lazier and lazier, test their work less, and just leave the bugs for QA to find. As it gets more adversarial, you just get more arguments as to what is and isn't a "bug." If you add cash incentives (eg. a bounty for each new bug found, and another for each bug fixed), you just get people cooperating to cheat the system by deliberately leaving easy bugs to be found and then fixed - for more bounty on both sides. As a bonus, management's metrics all improve: look how many more bugs we're finding now!

So that's how it normally gets tried and fails. But hopefully you knew that already. What's interesting is how it didn't fail at NASA.

The idea was actually very simple: the expected number of bugs found by QA should be zero. That's because developers should bloody well be capable of testing the code for themselves.

The QA department was just supposed to assure quality (hence the name!) - quality already present before it got anywhere near their department. A bug found at that phase is very serious, because it means the development processes aren't working right.

Their QA process is a test of the development process, not of the software itself. This is a really revolutionary idea - at least to me. Maybe it's supposed to be obvious. The obviousness of it was apparently lost to software developers sometime between 1987 and now.

In reality, NASA at that time had apparently documented six - six, in the history of the whole project - times that QA had found a bug. Those times were investigated and documented in detail, because each time was regarded as a critical failure of their development processes. Critical process failure warrants detailed involvement from very high-level people. Policies get changed, people get promoted, demoted, or fired. Each of those bugs resulted in changes to the development process so nothing like it would ever be able to happen again.

Is that what your QA team is like?

2007-06-26 »