ext2resize sucks, but apparently so do I, and also a lot of other things too
This has not been a happy week for my desktop workstation. It started off
good: two brand new SATA 750 GB disks to replace my old 80 GB one. What a
difference! Plus, this was finally the inspiration I needed to get
organized and install Linux as my base system and Windows XP in VMware,
instead of the other way around.
So the first thing I did was take out my 80 GB disk and put it aside for
later, in order to avoid screwing up, which is what people tend to do during
such activities. I booted a Debian Etch CD and proceeded to set up my two
disks in a RAID-1 mirror, just like I had planned.
This is where I stop to complain about Debian Etch's installer. What the
heck were you people thinking? Steal Ubuntu's, already! And the default
mode is to install with LVM but *not* with a RAID device backing it? And in
order to just set up a RAID on two identical disks, you have to go through
like 25 steps in your awful installer UI? This is really, really
disgusting.
pphaneuf's
rule for flaming is that you should shut up and not complain unless you
at least know a better way to do it. And better still, you should have
done it better. Well, I have. Nitix's disk configuration stuff is fully
automatic and pretty much foolproof. Sometimes it can be a little tricky to
wipe out a disk that already has stuff on it. But that was on purpose.
Other than that, it's just plain easy, and your disk gets configured with
RAID and LVM layers included even if there's only one disk, because that
makes things a heck of a lot more consistent, let alone making it really
easy to add a RAID later.
Anyway, I got through the painful disk installer UI after a few tries (you
have to reboot if you get it wrong! Genius!), and installed my system, and
rebooted, and all seemed okay, for a while.
That's when I installed Firefox 3 and my system suddenly became prone to
huge disk
grinding delays. Every time a program started writing to the disk,
Firefox would freeze, sometimes for many seconds at once. And Firefox 3,
you see, also writes to disk, so this was no rare occurrence.
You see, Debian had installed my new ext3 filesystem using the kernel's
default data=ordered option. data=ordered is one of those things that I'm
shocked was allowed into the kernel at all, let alone made the
default. Basically, it means that all relevant data will always be flushed
to disk before its journal (metadata) entries can be flushed, so your files
will never contain data that was just leftover if your computer crashes
between metadata and data updates. Sounds great, right? Well yes, until
you think about how you actually have to implement that. The journal is
always flushed sequentially, so if someone calls fsync() on a file, we try
to flush its data first, then all the metadata changes that were made on
disk up to and including this file's metadata. But that metadata
leading up to this change, of course, can't be flushed until we include
all the data related to all that metadata. This is a long story, but
essentially, it makes fsync() of a 4k file turn into essentially a sync() of
your entire disk. Result? Firefox 3, which fsync()s like 300 times a
minute for no good reason, grinds to a halt. And your system performance is
craptastic because basically your disk's write cache is gone.
After reading the mostly retarded
discussion in the Bugzilla case, I gave up on the Firefox guys actually
fixing the problem anytime soon. fsync() slightly less often?? That's a
fix?? What part of "every time I fsync() it syncs ALL outstanding
transactions immediately to disk" do you not understand?
So, I made a shared library called libnofsync.so that just makes fsync() do
nothing, and used LD_PRELOAD to load that into firefox. By bookmarks are
not a bloody ACID database! Nobody cares if they get corrupted! I only
ever visit like three sites, and two of them are GMail! Get over it. With
this hack, I suppose my Linux Firefox 3 is probably the fastest one around,
because data=ordered or not, everybody else is doing multiple synchronous
disk writes every time they try to load a page. Good grief.
The next task is to get rid of the completely obnoxious data=ordered
setting, which involved messing in my grub configuration file(s). Grub, if
you haven't heard me rant about it before, is a total complete pukefest. It
does absolutely nothing of value that lilo doesn't do, but it does it in a
way that's about 1000x more complicated. Then Debian layers another pile of
crud on top. The long and the short of this is that I COULD NOT FIND A WAY
to add "rootflags=data=writeback" to my kernel command line in any sort of
permanent fashion. Now, you can edit the kernel command line during boot,
and there's an option right there that says "savedefaults" that sounds
promising but does absolutely nothing, but (at least) one of the 1000 layers
of garbage shoveled on top of grub overwrites and/or ignores my config file
changes no matter how I try to make them.
So I uninstalled grub and installed lilo, after which things were trivially
easy because lilo was not written by morons.
Around this time I tried out a 2.6.25 kernel from backports.org, which
happily crashed my system repeatedly. What the heck was I thinking? The
kernel hasn't actually been missing any significant features for something
like five years. I don't know why I upgraded it, but I sure won't make
that mistake again. Anyway, the crashes served mostly to teach me
that if you crash your system while the RAID is rebuilding, it might not
come back all by itself. Instead, it silently kicks the second disk from
the RAID and leaves it idle, without actually telling anyone about this
INCREDIBLY SERIOUS RELIABILITY LOSS. But that's par for the course
by now. I added it back into my array and off I went.
Fast forward to a few days later; it took some suffering before I bothered
to fix the firefox nonsense and got sick enough of the random kernel crashes
to downgrade my kernel back to Debian's stable version.(1)
That was about when I noticed that my new, fancy-pants RAID was rebuilding
at 80 MB/sec, which is fabulously quick, and yet had still not
finished rebuilding, after being left uninterrupted for more than a
day. What? Well, let's "watch cat /proc/mdstat". Hey, check it out! It's
almost done! 98%... 99%... 100%... 100%... 100%... 0%? Hey! What the heck
is this! Okay, look at dmesg. Aha, it had a bad sector right at the very
end of the disk!
And so it decided to start rebuilding the RAID from scratch!
About once an hour since I installed it at the beginning of the
week!(2)
HA HA HA HA!
Now, that's not actually something that would solve the problem even if I
had bad sectors. But of course, I don't really have any bad sectors.
What I do have is some partitions that apparently Debian's installer screwed
up while creating, so they happily run right past the end of the disk.
Now okay, Debian's partitioning thingy has an excuse for being buggy;
probably nobody ever uses it to make a RAID, because it's sure the heck not
easy to do. But here's the thing: mdadm let me create a RAID on a partition
with inaccessible sectors. Then mke2fs let me create a filesystem on that
broken RAID. Did it not occur to anyone to sample a few of the sectors
before you decided to actually use them? Didn't any of you ever wonder why
Windows does that weird thing about "testing sector accessibility" whenever
you make a partition? No, apparently not. Gargle.
Okay, so obviously I need to make my partition a little smaller. Apparently
it's possible to resize ext3 partitions and RAID devices now. That's good
news, right? Well sure! Let's try that!
So I switched down to single user mode, remounted my rootfs read only, and
ran "ext2resize /dev/md0". It told me a magic number, which is the number
of sectors it's currently using. I reduced that number by an overly large
factor (hey, I've got whole gigabytes of data to waste here!) and ran
"ext2resize -v /dev/md0 NUMBER". It grinded away for a while, giving me
impressive yet scary messages about how it was moving inodes around, and so
on. I figured it couldn't really do anything too harmful, since
obvously the space at the end of the disk was nowhere near any of my actual
data.
Boy, was I ever wrong.
I foolishly then ran "ext2resize /dev/md0" again to see if it would print
out the new size. Except, it seems, that's not what it does. What it does
is try to resize the partition again, this time to the maximum size. The
maximum size is, as you recall, a size that involves some nonexistent
sectors at the end of my partition.
So it moved a few more inodes around and then errored out. Ironically,
ext2resize does apparently access that area at the end of the disk,
even if mke2fs doesn't. Sadly however, it moves a bunch of crap around
before erroring out and aborting midstream.
You might have imagined, as I might have, once, that ext2resize would maybe
do a "hypothetical resize" operation, going all the way through the disk and
confirming that everything would work - like the last sector it was about to
resize into, for example - before it actually starts moving crap around. Or
you might have imagined that it would undo those changes before it aborts
due to an unexpected error. But if you thought that, then you, like me,
would have been completely wrong. ext2resize does no such thing.
Instead, it moves a bit of data around, and then aborts when it gets
confused, halfway through the process.
As I learned, this makes your filesystem completely unusable. It turns out
that, for no reason I can possibly imagine, moving around the completely
empty unused section at the very end of my disk also involves rewriting
inode 2 (which turns out to be the root directory), as well as an impressive
number of other I-woulda-thought unrelated files and inodes. Of course,
when you do this wrong, your filesystem stops mounting.
Time to boot the rescue disk one more time, pray a little, run e2fsck, and
pray a little more.
e2fsck was not impressed. It correctly noted that inode 2 was, if I recall,
"conflicted." Also lots of other horrible things. I asked it to repair
everything. It crashed. Well, of course, it didn't crash exactly.
It printed messages about "programmer error??" and then restarted the game
all over again. I actually went through the game a few times before I
finally caught on to the fact that it was the same every single time.
Luckily, when I bravely answered "no" to all the questions about whether it
should fix things, it finally fixed things(3), complained that Oh
God Your Filesystem is Still Broken Though, and exited. But whatever, my
filesystem finally mounted again. I had to recover all my root-level
folders from their new homes in /lost+found, but oh well, at least I still
had my data.
And so I rebooted, and my RAID promptly started rebuilding itself again, and
my filesystem was still corrupt, and it couldn't actually be fsck'd because
e2fsck would still go into an infinite loop if I tried. Back to square
one.
And now, this is the part where I flame myself, because I forgot something I
already learned a long time ago and then sold to thousands of people:
Why the heck are you using a RAID in the first place, when you only have two
disks, idiot? If you only have two disks, just back up your files to the
second disk occasionally using rsync or something. That way, when you
ext2resize or just delete a file by accident, the other disk doesn't reflect
your idiotic mistakes until a while later. Remember?
So there you have it. All those things were dumb, but I was the dumbest of
them all.
I dropped the second disk out of my RAID, repartitioned it correctly, ran
mke2fs, copied all the files to the second disk, and booted from that.
Done.
Actual Nice Things to Say
Lest it appear that I only hate things:
(1) Yes, Debian's stable 2.6.18 kernel actually works, and thus
there's actually a reason they don't carry the highest-numbered one. Good
job, guys.
(2) The guys who implemented the CFQ disk scheduler are my
heroes. Unlike in the 2.4 kernel, where rebuilding your RAID hugely
degraded your system performance, in 2.6 this operation happens at "idle"
disk priority so it can go at pretty much full speed and yet have
zero impact on your disk performance. That's why I didn't even
notice for days that the rebuild was going on. Related tool: ionice -c3. Use it for your
background compiles and stuff. It's awesome.
(3) You know what? I'm a big fan of e2fsck, even though it's
supremely un-user-friendly and obviously had some bugs here. But despite
those bugs, it didn't abort when it thought it had a "programmer error," and
it saved my data from what I was sure by then was certain death. No
program is perfect, but at least this one was written by sane people.
August 2, 2008 02:44