200809 - apenwarr

Everything here is my opinion. I do not speak for your employer.

← August 2008

October 2008 →

2008-09-07 »

How to maximally insult Avery in the shortest amount of time

Make it not work properly when I print a PostScript file to a PostScript printer.

Oh yes, CUPS, that means you. You and foomatic and gnome-cups-manager and foomatic-gui and gutenprint and hpijs and... libsensors?! All of you. Yes, you. All of you in the ridiculously overcomplex newfangled Linux printing stack who can't do what "lpr" used to do in its default configuration with no filters installed. I hate you all.

Also, the "Print Test Page" button doesn't actually do anything, even if your printer is configured correctly. That was a really funny joke also. Ha ha. See me laugh.

2008-09-08 »

SATA vs. SCSI reliability

Here's a guy who discusses SATA vs. SCSI disk reliability. Short conclusion: actual disk failures (MTBF) are almost exactly as likely wtih cheap SATA disks as expensive SCSI disks. But the bit error rate of SATA is much higher. In other words, the likelihood of not being able to read a sector because it got corrupted on SATA is vastly higher than SCSI. By his calculation, on a 1TB disk, you have about a 56% chance of not being able to read every single sector, which means rebuilding your RAID correctly in the case of a failed disk is usually impossible.

It's true, and I learned this the hard way myself. Back at NITI, we ran into exactly this problem. Back in the days when we introduced our software RAID support, typical disk sizes were around 20 GB, about 50x smaller than they are now. (Wow!) The bit error rates now are about the same as they were then, which means, assuming the failure percentage declines linearly⁽¹⁾, about a 1.1% failure rate in recovering a RAID.

In general, that 1.1% failure rate isn't so bad. Remember, it's 1.1% on top of the rather low chance that your RAID failed in the first place, and even then it doesn't result in total data loss - just some inconvenience and the loss of a sector here or there. Anyway, the failure rate was small enough that nobody knew about it, including us. So when we had about a 1.1% rate of weird tech support cases involving RAID problems, we looked into it, but blamed it on bad luck with hard drive vendors.

By the time disks were 200GB and failure rates were more like 10%, we were having some long chats with those hard drive vendors. Um, guys? Your disks. They're dropping dead at a pretty ridiculous pace, here.

You see, we were still proceeding under the assumption that IDE disk are either entirely good, or they're bad. That is, if you get a bad sector on an IDE disk, it's supposed to be the beginning of the end. That's because modern disks have a spare sector remapping feature that's supposed to automatically (and silently) stop using sectors when the disk finds that they're bad. The problem, though, is it has to discover this at write time, not at read time. If you're writing a sector, you can just read it back, make sure it was written correctly, and if not, write it to your spare sector area. But if you read it back and it fails the checksum - what then?

This is the "bit error rate" problem. It's not nice to think about, but modern disks just plain lose data over time. You can write it one day, and read-verify it right afterwards without a problem, and then the data can be missing again tomorrow. Ugh.

And the frequency - per bit - with which this happens is the same as ever. With SCSI it's less than with SATA, but as we have more bits per disk, the frequency per disk is getting ridiculous. A 56% chance that you can't read all the data from a particular disk now.

There are two reasons you probably haven't heard about this problem. First, you probably don't run a RAID. Let's face it, if your home disk has a terabyte of stuff on it, you just probably aren't accessing all that data. Most of the files on your disk, you will probably never access again. Face it! It's true. If you filled up a 1TB disk, you probably filled it with a bunch of movies, and most of those movies suck and you will never watch them again. Stasticially speaking, the part of your disk that loses data is probably in the movies that suck, not the movies that are good, simply because the vast majority of movies suck.

But if you're using a RAID, you occasionally need to read the entire disk, so the system will find those bad sectors, even in files you don't care about. Maybe the system will be smart enough not to report those bad sectors to you, but it'll find them.

Secondly, even when you lose a sector here and there, you usually don't even care. Movies, again: MPEG streams are designed to recover from occasional data loss, because they're designed to be delivered over much less reliable streams than hard disks. What happens if you get a corrupt blob of data in your movie? A little sprinkle of digital junk on your screen. And within a second or so, the MPEG decoder hits another keyframe and the junk is gone. Whatever, just another decoder glitch, right? Maybe. Maybe not. But you don't really care either way.

The Solution

At NITI, we eventually settled on a clever solution to this that won't lose data on RAIDs. Of course we can't protect data on a non-RAID disk in any direct sense, but we strongly recommended for our customers to do frequent incremental backups instead.

But on a RAID, the problem is actually easier: simply catch the problem before a disk fails. In the background, we would be constantly, but slowly, reading through the contents of all your disks, averaging about one pass per week. If your RAID is still intact but we find a bad sector, no data has been lost yet: the other disks in the RAID can still be used to reconstruct it. So that's exactly what we would do! Reconstruct the bad sector, and write it back to the failing disk which could then automatically use its sector remapping code to make the bad sector disappear forever.

The read-reconstruction part was never open sourced, so if you want that, you'd have to write it yourself. Luckily, it was easy, and now that we have ionice you don't have to be nearly as careful to do it slowly in the background.

The other part was to make sure Linux's software RAID could recover in case it ran into individual bad sectors. You see, they made the same bad assumption that we did: if you get a bad sector, the disk is bad, so drop it out of the RAID right away and use the remaining good disks. The problem is that nowadays, every disk in the RAID is likely to have sector errors, so it will be impossible to rebuild the RAID under that assumption. Not only that, but throwing the disk out of the RAID is the worst thing you can do, because it prevents you from recovering the bad sectors on the other disks!

A co-worker of mine at the time, Peter Zion⁽²⁾, modified the Linux 2.4 RAID code to do something much smarter: it would keep a list of bad sectors it found on each disk, and simply choose to read that sector from the set of other disks whenever you tried to read it. Of course it would then report the problem back to userspace through a side channel, where we could report a warning about your disks being potentially bad, and accelerate the background auto-recovery process.

Sadly, while the code for this must be GPL as it modified the Linux kernel, the old svn repository at svn.nit.ca seems to have disappeared. I imagine it's an accident, albeint a legally ambiguous one. But I can't point you to it right now.

I also don't know if the latest Linux RAID code is any smarter out of the box than it used to be. As we learned, it used to be pretty darn dumb. But I don't have to know; I don't work there anymore. Still, please feel free to let me know if you've got some interesting information about it.

Footnote

⁽¹⁾ Of course the failure rate is not exactly a linear function of the disk size, for simple reasons of probability. The probability of a 1TB disk having no errors (0.44, apparently) is actually the same as the probability that all of a set of 50x 20GB disks has no errors. The probability of no failures on any one disk is thus the 50th root of that, or 98.4%. In other words, the probability of failure was more like 1.6% back in the day, not 1.1%.

⁽²⁾ Peter is now a member of The Navarra Group, a software contracting group which may be able to solve your Linux kernel problems too. (Full disclosure: I'm an advisor and board member at Navarra.)

Update 2008/10/08: Andras Korn wrote in to tell me that the calculations are a bit wrong... by about an order of magnitude, if you assume a 1TB RAID. Though many modern RAIDs will be bigger than that, since the individual disks are around 1TB. Do with that information what you will :)

Update 2008/10/30: I should also note that my calculations are wrong because I misread what the Permabit guy calculated in the first place, and didn't bother to redo the math myself. His calculations are not off by an order of magnitude, although there is some disagreement about whether the correct number is 0.44 or 0.56.

2008-09-13 »

Phrase of the day

...in the vortex of criticality.

General Alexander Haig

My challenge to you: on Monday, try to use the term "vortex of criticality" in a serious discussion at work.

2008-09-14 »

2008-09-18 »

The future is... boring

mcote has an interesting point about the increasingly widespread hypothesis that nothing new has happened lately, specifically, "no technological advancements in the last 20-30 years have significantly changed human lives, at least in the NorthWest."

That's a strong statement. Read his article to see where it comes from.

Now I'm going to tie it to something totally unrelated that has also been annoying me lately: PC operating systems. Specifically, the Linux Hater, the Vista disaster, and articles like IE 8 consumes more RAM than Windows XP.

What the heck is going on there? I mean, the PC is from the last 20-30 years, so it's not exactly the same argument. But it's the same effect on a shorter scale: nothing really new has happened in operating systems since Windows 95. You can tell because most of your apps would still run on Windows 95, if you installed enough upgraded DLLs to make the swooshy graphics work. Where there are exceptions, there are usually just lazy programmers who didn't test⁽⁰⁾ for or fix the minor compatibility bugs, not fundamental new technologies that weren't available back then.

That's 13 years ago. Nothing new has happened in desktop or server operating systems for 13 years. No, virtualization is not new. Yeesh.

It's because we're out of ideas. We built houses; now most houses are bigger than we need. We bought clothes; now clothes are out of fashion long before they wear out⁽¹⁾. We made cheaper, better food; people started eating less rice and more fat. We reduced the work week to 40 hours; now people spend their "free time" torn between idleness and stupidity. We let people retire earlier; now they get bored and start new businesses.

What's the pattern here? Technology fixes a problem, and then it overfixes it. In Canada and the U.S., we've all been safe from starvation or freezing to death for decades⁽²⁾, yet food and housing development continues. Why? Because we don't know how to stop.

Computers again. We made text editors; text editors expand until they can read email. We made web browsers; now web authors spend half their time choosing an optimal shade of blue and tweaking animation timings. We made software installers with automatic downloading and dependency checking; now systems like Debian split each package into infinitesimal pieces just because they can. We made spreadsheets; they were done by 1995, so we added Clippy instead. We made fancy GUIs with detachable, customizable toolbars and subwindows; now we have non-detachable, non-customizable ribbons and tabs. We made email, then newsgroups, web forums, blogs, and now twitter; the same thing over and over.

And yet we keep trying. Technology fixes a problem, and then it overfixes it - in desktop computing just like in everything. Windows 95 was it. And if Windows 95 wasn't it, then Unix was, or MacOS. What are we still doing here?

We don't know how to stop. That's all. Linux will never succeed on the desktop because nobody needs anything new from their desktop. And technology will never change society because society doesn't know what to do except optimize itself for making more technology. Our society is already very good at that, but now China's is better. Who cares? We were too efficient already.

Let's face it: society is bored, and technology is boring.⁽³⁾

If you want something new, it's going to have to be really new.

Footnotes

⁽⁰⁾ And rightly so. Nobody runs Windows 95.

⁽¹⁾ Actually, clothes don't always outlast their fashion. Nowadays clothes wear out fast, because we've invented fascinatingly cheap, delightful, weak new materials that. Nobody complains. Does it feel less wasteful this way?

⁽²⁾ There is a vanishingly small fraction of people in Canada who are not safe from these things. We know this is true or they would have frozen to death already. Hmm. Okay, maybe it needs a little more work.

⁽³⁾ As it happens, I am not personally bored by technology, because I'm one of the people boring you with it. Yet I still don't want a shorter work week, or to retire early, or the newest version of OpenOffice, Excel, Windows, Linux, Gnome, MacOS, or even ion.

P.S. Calling the Internet "an extension of the invention of the printing press" is like calling microelectronics "an extension of the invention of fire." It's true, but you need a new category system so you can draw useful conclusions.

2008-09-19 »

Apple isn't open?! Who knew?

This is an awesome rant about how Apple's iPhone store restrictions aren't so bad.

I know, the world would suck if nobody could sell software without permission. But they can, and it still sucks.

Mystery.

2008-09-20 »

iTunes Music Store. Diagnosis: Insane

Okay, I finally gave in. After reading about the fact that iTunes Plus now sells you DRM-free music for the same price as DRMed music, I figured, great, I don't have to boycott them anymore. As it happens, I had no interest whatsoever in downloading music from the iTunes music store in particular, but I have a few artists I figured deserve my money, and I've been too lazy to visit a real music store for months, so I figured this would be a good way.

Incidentally, there is absolutely no way to legally buy downloadable music online in Canada except iTunes. What the heck? Amazon, the other choice in the U.S., makes you download and install their idiotic auto-download tool before they give you any hint that they don't even offer their service in Canada. And nobody else exists at all. iTunes was last on my list of choices, but it was the only one.

Okay, whatever. Everybody else is asleep at the switch, and I'm no big iTunes fan, but they'll sell me DRM-free music - in Canada - for a reasonable price. I already have a Mac and iTunes and an iPod, and I've confirmed that Linux really can play DRM-free AAC music just in case, so I'm not getting any more locked in than I already was. Sign me up.

That's when I made my next mistake: I bought and downloaded DJ Tiesto's In Search of Sunrise 7 without reading the iTunes online reviews first.

Here's what the reviews told me (and they were right): there are only 24 of the 28 tracks from the CD, and they're all separate, not mixed.

Hold on a second, my rage is building.

Not. Mixed?

THEY TOOK IN SEARCH OF SUNRISE 7 AND UNMIXED THE TRACKS AND THEN SOLD ME ONLY SOME OF THEM WHILE CLAIMING IT WAS THE WHOLE ALBUM.

They sold me A TIESTO ALBUM WITH THE TRACKS UNMIXED.

What the crap?!?

For those of you who haven't heard of Tiesto, here's what he does. He picks a bunch of tracks that he likes, then mixes them together into a really awesome continuous mix where you can barely tell that one track blends into another. In other words, his value added is that he's a DJ. He mixes tracks.

They sold me his CD, but with the tracks unmixed.

Speechless.

That was my first, and now last, iTunes Music Store purchase. Good grief. If I stole the music using P2P, I would have gotten better service.

Horrified.

Update 2008/09/22: Of course I emailed iTunes Store support about this. Their first response was to refund my money and give me a free credit and say they were going to continue looking into it, all of which is commendable. It doesn't make the problem itself any less insane - the process by which some producer somewhere authorized Tiesto's removal from a Tiesto CD is mind boggling - but I give them credit for good customer service. Meanwhile, I guess I still won't be buying more music from iTunes, but not because I'm angry; it's because I listen to mostly mixed electronic music, and I have no way of knowing they won't randomly sabotage it.

Updated diagnosis: still insane, but nice.

2008-09-22 »

I want this

Yup, Joel is Right. Presumably ThinkGeek will be coming out with one shortly.

2008-09-27 »

Newsflash! Telecommuting actually works... sort of

I've long been an opponent of having workers outside the office, at least as far as programmers are concerned. This was for the most practical reason of all: I tried it. It didn't work.

Let's be a little more specific. A few years ago, I was managing the development team at NITI, a company that I founded while in university. (It was later acquired by IBM.) One of the developers wanted to go live in Europe for a while, but asked to keep working for us while he did. Okay, I thought, that sounds reasonable. He's a hard worker, he gets stuff done, and I'm a lousy supervisor anyway. Working from Europe won't be much different from working at a desk across the room, if most of our communication is through email or the bug tracking system anyhow.

It didn't work. Not because he did anything bad or didn't get any work done; he got work done just fine. It didn't work because it didn't work the way people -- I, he, and everyone else at the company -- naively expected it to work.

Working from home is inherently demotivating. Not just because you don't have a boss breathing down your neck. Admittedly, some people do need that, and to those people I say: Stop reading now. You're not cut out for this. Get an office job, or suffer the consequences.

No, motivation comes from more than just an overzealous manager. It comes from feeling the need to be at a particular place at a particular time; the need to get dressed and shower before you start work; the need to show your co-workers that you're not a slacker; the need to brag about your latest brilliant accomplishment over a game of foosball. In short, it comes from peer pressure (which I like to call mutual motivation). Even if your peers don't really care, which ours mostly didn't, the feeling of peer pressure is all the motivation most people need, and it's great.

The view from the other side is just as bad. Even peers who normally wouldn't care what you're doing can't help but wonder. "Oh, he's off gallavanting in Europe - how much work can he possibly be doing?" Or maybe they see some code you checked in, and it looks superficially like crap, and you're not there to explain it, so they're free to assume that it is crap. Suffice it to say, things go downhill fast.

Now, in this particular case, we were able to keep things in order. You could say I was sensitive to being sensitive to these things, so I made sure to keep perspective. Eventually it didn't work out because he didn't like Europe as much as he thought he would, so he came back to the office.

NITI R&D as a whole had a grander version of the same problem: all the developers were working remotely, in the sense that the R&D office was run independently in a different city than the head office. The result? People at head office always felt like we were off gallavanting in Montreal; how much work could we possibly be doing? And whenever a bug, no matter how difficult, wasn't solved right away, it was because we weren't accessible, or because our flex hours were too flexible, or because some of the developers were just lazy, or whatever. This was all basically imaginary, but you can see why they'd think that; we thought the same about them.

Meanwhile, the truth was that we didn't have the daily exposure to customers that we would have if we were in the head office, and that did have a huge negative impact on our ability to respond. We responded really fast to a lot of things, but there was no way to focus on what was really important to people at head office, because we weren't right there to observe it for ourselves.

But that's another story. What I want you to see is how it just didn't work. Nowadays, there's no more Montreal R&D office; head office shut it down, and I supported the decision to do it.

That's the bad news. Here's the good news: I was dumb enough to try it again, but differently this time. This time, it does work. I think I know why.

At Versabanq I wanted to hire some of my former NITI co-workers to do development work for me. After all, I've worked with them, I know them, and they know what I want. The only problem is, being based on Ontario, we had no particular reason to open an office in Montreal. And I didn't want to make the same mistake as last time, with the same predictable outcome.

What I did instead is help them work together to form The Navarra Group, an independent software contracting company. In other words, I'm not only a Navarra co-founder, but I'm also a client :)

How to make telecommuting work

The first problem with telecommuting is the name. You're not commuting. Commuting implies going to an office, and telecommuting implies pretending to go to an office. Telecommuting doesn't work, because you can't be there without being there. It's a contradiction in terms. Just forget it.

Instead, you have to clarify what you really want. What I really wanted was: a bunch of programming work done according to my specifications with at least the same level of quality and reliability I would get from hiring full-time employees, without having to manage a bunch of full-time employees. What they really wanted was: to do really good work for a fair price in a fun social environment.

Here are some things we had to watch out for:

Don't charge an hourly rate. I know, everybody always charges an hourly rate for oursourcing. Let it go. Hourly rates are the source of most jealousy around remote workers. After all, how do you really know that their 40 hours of work is as good as my 40 hours of work? What if they're only working five hours a week and slacking off the rest of the time? The only answer that works is: who cares? We're not paying them by the hour. We're paying for output, and we agreed that the output they're producing, by the agreed-upon deadline, is worth the amount we're paying them. End of story. Most salaried employees will be glad you're not holding them to such standards.
Don't pretend you're part of the office culture. You're not. That part of your life is over. All those cautionary tales about remote workers missing out on promotions, not having their work recognized, missing out on the chat around the water cooler, not being invited to parties? All true. If you want those things, go work at the office. If you think that you'll get promotions and raises just because your boss will recognize the quality of your work in absentia, you're doomed. As a remote worker, you are an outsider. That's what "remote" means. And yet...
Don't pretend you don't need a work-related social environment. Programmers are introverts, right? They don't need people, right? Well, there are a few anecdotal programmers out there who can survive 24 hours a day in their basement. Trust me, you're not one of them. If you try it yourself, you will go stark, raving bananas. It might take a few months, but you will. You will need to feel like your work is a part of something, or it will feel like a black hole. But if you're not part of the social environment back at the office, then what are you a part of? Well...
Create a social environment out of other remote workers like you. The reason you don't fit in at a normal employer is that all the employees aren't like you. You're different. But as a contractor, you're not different from other people doing the same thing as you. Like you, they're not interested in sucking up to someone to get a promotion, but they have the same needs you do: a social working environment, a reason to get dressed in the morning, someone to brag to, someone to play foosball with.

It's early yet. Right now, the social environment at Navarra consists of weekly meetings, daily irc, working together at different members' homes, and random social gatherings in Montreal. Eventually, perhaps Navarra will get a real office, so people have a place that isn't home where they can go to work. Or maybe they'll rent some space at a coworking facility with some like-minded individuals.

It's not all perfect yet, but it feels to me like it's on the right track. People aren't 100% motivated to work all the time, but as a client, I don't have to pay for that; I'm just paying the flat rate. Which means that, far from being something to feel guilty about, slacking off for a day without answering to anyone is an awesome job perk.

And the best part? No matter how stupid your client gets, they can't ruin your working environment. It's not theirs anymore.

2008-09-29 »

Have journalists declare their political preferences?

So, we all know that reporting is biased, because there's no such thing as an objective viewpoint. In Canada we try to keep ourselves reasonably assured that the CBC, for example, reports political discussion fairly. In my opinion, it seems to work for us. But how can I know for sure? How can even they know they're being fair?

Rick bookstaber has a simple idea that makes you think: include political preferences of the reporter/editor in each story's byline.

The bad news is that this would fundamentally change the idea of journalism; on the other hand, maybe that's not such a bad idea at this point.

Update 2008/09/30: A convenient example of what I mean, via Peter Norvig. He admits his bias up front, then proceeds to produce as unbiased a report as he can manage. (And he does pretty well at it!) I wish such a thorough resource existed for the Canadian election.

← August 2008

October 2008 →

I'm CEO at Tailscale, where we make network problems disappear.

Why would you follow me on twitter? Use RSS.

apenwarr on gmail.com