Great moments in probability

Great moments in probability

Years ago (around 1999) when dcoombs and I were debugging the first versions of our "weaver" Linux-based server appliances from our apartment in Waterloo, we used to test on the cheapest hardware we could obtain for cheap.

One of these boxes absolutely refused to boot weaver, but the symptoms were strange. We had three ways of booting: boot from a CD, install an image on the hard drive and boot that, or load Etherboot from a floppy and use that to network-boot the kernel over tftp.

The symptoms were as follows:

Booting from CD worked fine.
Installing from CD to the hard drive and booting that worked fine.
Booting a weaver image from the hard drive (with a kernel downloaded via ftp) always gave kernel decompression error.
The etherboot TFTP process would always abort with a timeout after a few packets. (Etherboot of the era would do that occasionally even on a good day, but here it happened every time.)

The obvious conclusion here was that our weaver kernel image was broken, because you could boot the Debian kernel from either CD or hard disk without a problem. Right?

Well... as it turned out, no. The actual problem was a horribly broken network card that would randomly corrupt bits. About 9 out of 10 packets would be corrupted. You'd think that would be obvious, right?

Well, no. In fact, TCP/IP is specially designed to deal with the occasional corrupted packet. TCP and UDP have a 16-bit checksum on every packet, and if it doesn't match, the receiver simply throws the packet away; the sender is supposed to resend (and it does!).

I had noticed the FTP transfers were surprisingly slow, but not that slow, and back in those days, you could never quite remember if your network card was 10 MBit or 100 MBit. This happened to be a 100 MBit card, but 9/10 packets were getting thrown away, so we got around 10 MBit performance from ftp.

But here's what killed us: a 16-bit checksum can only detect 65535 out of 65536 possible errors. A 9/10 error rate means you're sending 10x as much data as you think you are, so a 12MB kernel+rootdisk package is actually about 120MB of packets; that is, about 80000 packets at 1500 bytes each. Thus, virtually every transfer was destined to have a tiny number of incorrect bytes! Ha!

Of course TFTP is extra dumb and doesn't deal well at all with packet loss, so it would just time out. But I remain very impressed at how well TCP managed to paper over a 90% broken network. That's the power of the Internet for you, right there.

(Thanks to jwz for having a hopefully-unrelated problem that reminded me of this.)

2008-05-09 »