A quick review of btrfs

A quick review of btrfs

Yesterday I tried out btrfs, an experimental new Linux filesystem that seems to be based on many of the ideas of the more famous "zfs," which I gather is either not available on Linux or not properly ported or not licensed correctly or whatever. I really didn't do my zfs research; all I know is it's not in the 2.6.36 Linux kernel, but btrfs is, so btrfs wins.

Anyway, the particular problem I wanted to solve was for our startup, EQL Data. Basically we end up with some pretty big mdb files - a gigabyte or two, some days - and we want to be able to "branch" and "merge" them, which we do using git. Perhaps someday I'll write an article about that. Meanwhile, the branching and merging works fine, but what doesn't work fine is creating multiple copies so that different people can modify the different branches in different ways. It's not that bad, but copying a gigabyte file around can take a few seconds - more on a busy virtual server - and so if we want startup times to be reasonable, we've had to resort to pre-copying the files in advance and keeping them in a depot so that people can grab one, modify it, and send it back for processing. This works, but wastes a lot of background I/O (which we're starting to run out of) as well as disk space (which directly costs money).

So anyway, the primary thing that I wanted was the ability to make O(1) copy-on-write copies of large files. btrfs is advertised to do this. I won't keep you in suspense: it works! There's a simple btrfs command called "bcp" (btrfs cp), and it works the way you'd think. Only modified blocks take any extra disk space.

Our second problem relates to the way we actually run the databases. We have a complex mix of X, VNC, flash, wine, Microsoft Access¹, and some custom scripts. Each session runs in its own little sandbox, because as you might have heard, Windows apps like to think they own the world and scribble all over everything, especially the registry, which we have to undo after each session before we can start the next one.

We've tried various different cleanup techniques; the problem is that each wine tree consists of thousands of files, so even checking if a particular tree has been modified can be a pretty disk-grindy operation. With experimentation, we actually found out that it can be faster to just remove the whole tree and copy a new one, rather than checking whether it's up to date. This is because the original tree is usually in cache, but the old one often isn't, and reading stuff from disk is much slower than writing (because writeback caching lets you minimize disk seeks, while you can't really do this when reading). Nevertheless, it still grinds the disk,² and it also wastes lots of disk space, though not as much as you might expect, since we use hardlinks extensively for individual files. Too bad we can't hard link directories like Apple does in Time Machine.

Luckily, btrfs can come to the rescue again, with O(1) writable snapshots of entire subvolumes. This is slightly less convenient than using bcp on a single file. Snapshots don't really act much like hardlinks; they work by quietly creating an actual separate filesystem object and showing it as if it's mounted on a subdir of your original filesystem. (This doesn't show up in /proc/mounts, but if you stat() a file in the copy, it shows a new device id.) Presumably this is because otherwise, you'd end up with files that have the same device+inode pair, which would confuse programs, and you can't correctly renumber all the inodes without going from O(1) to O(n) (obviously). So all in all, it's clever, but it takes more steps to make it work. It still runs instantaneously once you figure it out, and the extra steps aren't that bad. It means we can now have the ability to clone new virtual environments instantly, and without taking any extra disk space. Nice!

Oddly, in my 2.6.36 kernel (plain btrfs as included in Linus's pristine kernel), I could create snapshots as a normal user (once I had set things properly with chmod), but I couldn't delete them as a normal user; I had to be root. This can't really be considered correct behaviour, so I might have to file a bug or join the mailing list or something. Later.

Other observations:

creating a filesystem was instantaneous; no boring writing of inode tables like in ext2/ext3.
mounting a filesystem was also instantaneous; no journal flushing or scanning.
creating large files seems to cause a strange burp after writing 128 megs or so; it stops for a rather long time, then resumes writing. Perhaps this has to do with allocating a new extent or something. I'm not sure yet whether this is normal or repeatable behaviour, and it probably won't affect my use case all that much.
despite the aforementioned burp, creating large files seems somewhat faster than on ext3 (my imagination?) and deleting them happens instantly, instead of after a lot of disk grinding with ext3. This seems to be the magic of the extent-based disk allocation again, I suppose.
adding and removing devices in the volume (I was using /dev/loop devices for testing) works great; removing is obviously slower than adding, since it has to move data to the remaining devices, but it seemed flawless. The filesystem stayed fully operational (and not noticeably slower) the whole time, which implies that they have their I/O priorities right.
Slicehost doesn't compile btrfs into their standard kernels, apparently; I'll have to see if I can make it work. If not, it's time to switch hosting providers. I did my testing on my own machine at home.
btrfs has big loud warnings everywhere about how unstable it is and how the on-disk format is likely to change. Since all our usage will be with short-lived session files, this shouldn't be a problem for us... but be careful if you get excited about using it. I'm not sure I'd trust it as my root fs right now, at least not if I plan to ever upgrade my kernel.

Short version: btrfs is very nicely done. Will it be the next major Linux filesystem? Well, I bet on reiserfs last time, and that didn't really work out, what with the bugs and the insanity and the murder. Nevertheless, I'm optimistic. At the very least, maybe it'll give some good ideas to the ext4 guys.

Next I'm going to start looking into some stress tests and figuring out if we can safely take it into production (again, in our controlled environment where loss of a filesystem and/or incompatible filesystem format changes aren't the end of the world).

Footnotes

¹ Yes, we have all the required licenses. Why does everyone assume that just because we're running wine, we must be slackers trying to dodge license agreements? We're using it because hosting on Linux is vastly more memory-efficient than Windows. Also I just like Linux better.

² Nowadays it grinds the disk less than it used to, because in the newer kernels, Slicehost has switched to using ext3's data=writeback instead of data=ordered. The tech support guy I talked to a few months ago said he had never heard of this, and no, there was no way to do it. I wonder if they changed their setting globally because of me?

2010-11-11 »