This has not been a happy week for my desktop workstation. It started off good: two brand new SATA 750 GB disks to replace my old 80 GB one. What a difference! Plus, this was finally the inspiration I needed to get organized and install Linux as my base system and Windows XP in VMware, instead of the other way around.
So the first thing I did was take out my 80 GB disk and put it aside for later, in order to avoid screwing up, which is what people tend to do during such activities. I booted a Debian Etch CD and proceeded to set up my two disks in a RAID-1 mirror, just like I had planned.
This is where I stop to complain about Debian Etch's installer. What the heck were you people thinking? Steal Ubuntu's, already! And the default mode is to install with LVM but *not* with a RAID device backing it? And in order to just set up a RAID on two identical disks, you have to go through like 25 steps in your awful installer UI? This is really, really disgusting.
pphaneuf's rule for flaming is that you should shut up and not complain unless you at least know a better way to do it. And better still, you should have done it better. Well, I have. Nitix's disk configuration stuff is fully automatic and pretty much foolproof. Sometimes it can be a little tricky to wipe out a disk that already has stuff on it. But that was on purpose. Other than that, it's just plain easy, and your disk gets configured with RAID and LVM layers included even if there's only one disk, because that makes things a heck of a lot more consistent, let alone making it really easy to add a RAID later.
Anyway, I got through the painful disk installer UI after a few tries (you have to reboot if you get it wrong! Genius!), and installed my system, and rebooted, and all seemed okay, for a while.
That's when I installed Firefox 3 and my system suddenly became prone to huge disk grinding delays. Every time a program started writing to the disk, Firefox would freeze, sometimes for many seconds at once. And Firefox 3, you see, also writes to disk, so this was no rare occurrence.
You see, Debian had installed my new ext3 filesystem using the kernel's default data=ordered option. data=ordered is one of those things that I'm shocked was allowed into the kernel at all, let alone made the default. Basically, it means that all relevant data will always be flushed to disk before its journal (metadata) entries can be flushed, so your files will never contain data that was just leftover if your computer crashes between metadata and data updates. Sounds great, right? Well yes, until you think about how you actually have to implement that. The journal is always flushed sequentially, so if someone calls fsync() on a file, we try to flush its data first, then all the metadata changes that were made on disk up to and including this file's metadata. But that metadata leading up to this change, of course, can't be flushed until we include all the data related to all that metadata. This is a long story, but essentially, it makes fsync() of a 4k file turn into essentially a sync() of your entire disk. Result? Firefox 3, which fsync()s like 300 times a minute for no good reason, grinds to a halt. And your system performance is craptastic because basically your disk's write cache is gone.
After reading the mostly retarded discussion in the Bugzilla case, I gave up on the Firefox guys actually fixing the problem anytime soon. fsync() slightly less often?? That's a fix?? What part of "every time I fsync() it syncs ALL outstanding transactions immediately to disk" do you not understand?
So, I made a shared library called libnofsync.so that just makes fsync() do nothing, and used LD_PRELOAD to load that into firefox. By bookmarks are not a bloody ACID database! Nobody cares if they get corrupted! I only ever visit like three sites, and two of them are GMail! Get over it. With this hack, I suppose my Linux Firefox 3 is probably the fastest one around, because data=ordered or not, everybody else is doing multiple synchronous disk writes every time they try to load a page. Good grief.
The next task is to get rid of the completely obnoxious data=ordered setting, which involved messing in my grub configuration file(s). Grub, if you haven't heard me rant about it before, is a total complete pukefest. It does absolutely nothing of value that lilo doesn't do, but it does it in a way that's about 1000x more complicated. Then Debian layers another pile of crud on top. The long and the short of this is that I COULD NOT FIND A WAY to add "rootflags=data=writeback" to my kernel command line in any sort of permanent fashion. Now, you can edit the kernel command line during boot, and there's an option right there that says "savedefaults" that sounds promising but does absolutely nothing, but (at least) one of the 1000 layers of garbage shoveled on top of grub overwrites and/or ignores my config file changes no matter how I try to make them.
So I uninstalled grub and installed lilo, after which things were trivially easy because lilo was not written by morons.
Around this time I tried out a 2.6.25 kernel from backports.org, which happily crashed my system repeatedly. What the heck was I thinking? The kernel hasn't actually been missing any significant features for something like five years. I don't know why I upgraded it, but I sure won't make that mistake again. Anyway, the crashes served mostly to teach me that if you crash your system while the RAID is rebuilding, it might not come back all by itself. Instead, it silently kicks the second disk from the RAID and leaves it idle, without actually telling anyone about this INCREDIBLY SERIOUS RELIABILITY LOSS. But that's par for the course by now. I added it back into my array and off I went.
Fast forward to a few days later; it took some suffering before I bothered to fix the firefox nonsense and got sick enough of the random kernel crashes to downgrade my kernel back to Debian's stable version.(1)
That was about when I noticed that my new, fancy-pants RAID was rebuilding at 80 MB/sec, which is fabulously quick, and yet had still not finished rebuilding, after being left uninterrupted for more than a day. What? Well, let's "watch cat /proc/mdstat". Hey, check it out! It's almost done! 98%... 99%... 100%... 100%... 100%... 0%? Hey! What the heck is this! Okay, look at dmesg. Aha, it had a bad sector right at the very end of the disk!
And so it decided to start rebuilding the RAID from scratch!
About once an hour since I installed it at the beginning of the week!(2)
HA HA HA HA!
Now, that's not actually something that would solve the problem even if I had bad sectors. But of course, I don't really have any bad sectors. What I do have is some partitions that apparently Debian's installer screwed up while creating, so they happily run right past the end of the disk.
Now okay, Debian's partitioning thingy has an excuse for being buggy; probably nobody ever uses it to make a RAID, because it's sure the heck not easy to do. But here's the thing: mdadm let me create a RAID on a partition with inaccessible sectors. Then mke2fs let me create a filesystem on that broken RAID. Did it not occur to anyone to sample a few of the sectors before you decided to actually use them? Didn't any of you ever wonder why Windows does that weird thing about "testing sector accessibility" whenever you make a partition? No, apparently not. Gargle.
Okay, so obviously I need to make my partition a little smaller. Apparently it's possible to resize ext3 partitions and RAID devices now. That's good news, right? Well sure! Let's try that!
So I switched down to single user mode, remounted my rootfs read only, and ran "ext2resize /dev/md0". It told me a magic number, which is the number of sectors it's currently using. I reduced that number by an overly large factor (hey, I've got whole gigabytes of data to waste here!) and ran "ext2resize -v /dev/md0 NUMBER". It grinded away for a while, giving me impressive yet scary messages about how it was moving inodes around, and so on. I figured it couldn't really do anything too harmful, since obvously the space at the end of the disk was nowhere near any of my actual data.
Boy, was I ever wrong.
I foolishly then ran "ext2resize /dev/md0" again to see if it would print out the new size. Except, it seems, that's not what it does. What it does is try to resize the partition again, this time to the maximum size. The maximum size is, as you recall, a size that involves some nonexistent sectors at the end of my partition.
So it moved a few more inodes around and then errored out. Ironically, ext2resize does apparently access that area at the end of the disk, even if mke2fs doesn't. Sadly however, it moves a bunch of crap around before erroring out and aborting midstream.
You might have imagined, as I might have, once, that ext2resize would maybe do a "hypothetical resize" operation, going all the way through the disk and confirming that everything would work - like the last sector it was about to resize into, for example - before it actually starts moving crap around. Or you might have imagined that it would undo those changes before it aborts due to an unexpected error. But if you thought that, then you, like me, would have been completely wrong. ext2resize does no such thing. Instead, it moves a bit of data around, and then aborts when it gets confused, halfway through the process.
As I learned, this makes your filesystem completely unusable. It turns out that, for no reason I can possibly imagine, moving around the completely empty unused section at the very end of my disk also involves rewriting inode 2 (which turns out to be the root directory), as well as an impressive number of other I-woulda-thought unrelated files and inodes. Of course, when you do this wrong, your filesystem stops mounting.
Time to boot the rescue disk one more time, pray a little, run e2fsck, and pray a little more.
e2fsck was not impressed. It correctly noted that inode 2 was, if I recall, "conflicted." Also lots of other horrible things. I asked it to repair everything. It crashed. Well, of course, it didn't crash exactly. It printed messages about "programmer error??" and then restarted the game all over again. I actually went through the game a few times before I finally caught on to the fact that it was the same every single time.
Luckily, when I bravely answered "no" to all the questions about whether it should fix things, it finally fixed things(3), complained that Oh God Your Filesystem is Still Broken Though, and exited. But whatever, my filesystem finally mounted again. I had to recover all my root-level folders from their new homes in /lost+found, but oh well, at least I still had my data.
And so I rebooted, and my RAID promptly started rebuilding itself again, and my filesystem was still corrupt, and it couldn't actually be fsck'd because e2fsck would still go into an infinite loop if I tried. Back to square one.
And now, this is the part where I flame myself, because I forgot something I already learned a long time ago and then sold to thousands of people:
Why the heck are you using a RAID in the first place, when you only have two disks, idiot? If you only have two disks, just back up your files to the second disk occasionally using rsync or something. That way, when you ext2resize or just delete a file by accident, the other disk doesn't reflect your idiotic mistakes until a while later. Remember?
So there you have it. All those things were dumb, but I was the dumbest of them all.
I dropped the second disk out of my RAID, repartitioned it correctly, ran mke2fs, copied all the files to the second disk, and booted from that. Done.
Actual Nice Things to Say
Lest it appear that I only hate things:
(1) Yes, Debian's stable 2.6.18 kernel actually works, and thus there's actually a reason they don't carry the highest-numbered one. Good job, guys.
(2) The guys who implemented the CFQ disk scheduler are my heroes. Unlike in the 2.4 kernel, where rebuilding your RAID hugely degraded your system performance, in 2.6 this operation happens at "idle" disk priority so it can go at pretty much full speed and yet have zero impact on your disk performance. That's why I didn't even notice for days that the rebuild was going on. Related tool: ionice -c3. Use it for your background compiles and stuff. It's awesome.
(3) You know what? I'm a big fan of e2fsck, even though it's
supremely un-user-friendly and obviously had some bugs here. But despite
those bugs, it didn't abort when it thought it had a "programmer error," and
it saved my data from what I was sure by then was certain death. No
program is perfect, but at least this one was written by sane people.
August 2, 2008 02:44
Weird. The LinuxHater guy wrote an article saying that wide-open bug trackers don't work, and said a bunch of things that I had been planning to say. I also made some related comments earlier. Here are a few more:
Lately, there's been some discussion about distributions like Ubuntu just closing out massive numbers of bugs with comments like "Please try it in the new version and reopen if the bug still exists." Naturally, users are offended by this: I played your game! I filed a bug like you asked! I included reproducible test steps! I even included a patch in some of them! Couldn't you at least do me the basic service of running through my reproduction steps before closing the bug?
The answer is: no, actually, they can't. There are way more users than developers and testers, and those developers are assembling and tracking bugs for all the software that will run on your entire computer. It is just too much work to do properly.
Not that this justifies the crappiness; it just explains it. My point is you can't just say, "They shouldn't have closed my bug!" as if it were a solution, because it's not. Developers also can't say, "Please just reopen this bug if it still exists!" as if it's a solution, because it's not either.
So far, nobody has offered a solution that would actually work.
August 5, 2008 20:18
Wow, my Linux disk-related rant seems to have been featured on Linux Hater. I am strangely pleased, although for the record, I hate all computers and all operating systems, not just Linux.
Now, I've been advised that another hot topic right now is some kind of
ridiculous conspiracy theory about a laptop vendor who "deliberately" made
Linux's ACPI implementation not work on their system, while Windows works
fine. Without going to the unnecessary effort of looking up the article or
checking any facts, I can tell you that I already thoroughly
debunked all Linux-ACPI conspiracy theories a couple of years ago. You
may find it either horrifying or funny.
August 5, 2008 21:12
Havoc Pennington has an excellent article on Return on Equity (ROE).
August 6, 2008 00:18
- Being an entrepreneur doesn't mean you're smart. A smart guy will
figure out how to make a bunch of money while taking on much less risk.
Corollary to this, there are a lot of dumb entrepreneurs out there.
It's like he can read my mind, then swears a lot.
August 17, 2008 20:40
At work we have a lot of servers, and sometimes they have problems. Even though *I* mostly just handle development servers, not production ones, it's still kind of a pain to keep track of them all and it's always embarrassing to find out from a co-worker that one of them has stopped working.
So I use a bunch of cron jobs to do a bunch of daily tasks and make sure things are flowing smoothly, but this isn't really all that optimal. The problem is that you don't want cron emails when tasks *work* (some of them are running every ten minutes!). But you *do* want emails when they don't work. On the other hand, you don't need an email every *ten minutes* when something breaks. And moreover, you really want to be notified when the server the task was supposed to run on has crashed and the job doesn't run at all.
Sadly, cron itself doesn't do a very good job of this, particularly that last part, where it's completely useless.
So I spent a few hours today whipping up cron2rss. It reads, saves, and eventually expires the stdout/stderr output from multiple cron jobs on multiple computers, then turns the result into two RSS feeds: one with everything, and one with only the failures. *And* it auto-inserts entries into the feed whenever it's been too long since one of the tasks has produced a log message. The RSS service can also be run on more than one computer at a time, so that if one of your RSS feeds dies, the others can still tell you about a failure.
I leave you with this food for thought:
0,10,20,30,40,50 * * * * ~/cron2rss/add test-website wget -O/dev/null http://versabanq.com
How do you set up something like that if you don't have my tool? What
would you do if your test was more complicated? What if you had 50 servers
instead of 1?
August 18, 2008 23:09
Well, it seems to be my week for small, handy tools. As a followup to cron2rss from a few days ago, I now present to you: gitbuilder. It's an autobuilder tool for your favourite git-based project, with built-in bisection support.
I've been working on this for a while, but I finally decided to get it production-ready a few days ago after I spent several hours tracking down a problem... that turned out to already be caught by our unit tests. Except nobody had run the unit tests recently. Oops. Well, now I'll run them for you, thank you very much.
Check out a sample of the results at the Versaplex autobuilder page.
GitHub.com is way more addictive than I
thought. It's what SourceForge and Google Code should have been: a
really easy way for people to publish, fork, and merge source code, with a
few extra hyperlinks thrown in here and there for good measure. The key
thing is the non-global namespace, so you can just dump stuff there and give
it to people whenever you want; it's like an extension of your local ~/src
directory. Thanks to wlach for
convincing me to try it.
August 23, 2008 03:28
Here's a nicely disgust-filled article about various problems with CSS. He seems bewildered by how something so bad could have been invented, but it's really very simple. The people who invented and then standardized CSS had never used it. After all, they invented it before any web browser supported it.
The best languages are invented by the people who are using them to do real work, while they use them to do real work.
I said something
similar, with examples back in 2004.
August 19, 2008 17:57
This guy continues to blow my mind. Some tips on formatting punctuation in a visually appealing way. Check out the section on how to center things horizontally. And naturally there's a bonus section on the proper formatting of smileys as punctuation.
Reading this stuff is a good way to remind myself why I'm a crappy visual designer.
And also via Art Lebedev: Leonardo da Vinci invented stinkbombs.
He also has a brilliantly simple way of expressing the success of a design in terms of its requirements. "Design is not something to discuss ... the only thing that can be discussed is whether the task has or has not been achieved."
Plus there's an article on designing well by designing badly on purpose. This method is ingenious, since it lets you imagine your users as super-intelligent enemies and then work to defeat them. Considering how hard it is to really believe how helpless users can be, overestimating them could be much easier.
August 19, 2008 18:23
Art Lebedev writes about his concept of a Unit of Sense: something new which you create and which requires intelligence to do.
He claims that the average "good designer" produces 1-2 units of sense per month. An excellent designer, maybe four. And an artistic director, at least 6. He also claims that a fair way to pay designers is to multiply their produced units of sense by a fixed dollar value.
"Things like good ideas, interesting concepts and fresh techniques are units of sense. A one-hundred-page brandbook may also be considered a unit of sense - not on a per-page basis, but as a whole."
The concept is really interesting, as it offers a way to quantify designer output, something that's notoriously hard to measure. Unfortunately, his description is itself a little hard to quantify.
Obviously this is hard to apply to anything other than designers, as the majority of work that most people do is perfectly valid work, but simply isn't creative. I guess they should still get paid, though.
One way to think of it is like this: as a programmer, perhaps the interest
level you have in your job is a function of your natural unit of sense
creativity level, as compared to your creativity level at this job.
Or maybe your job satisfaction is indicated by your creativity level, not
the other way around.
August 19, 2008 22:38