Version control of really huge files

Version control of really huge files

So let's say you've got a database with a 100k rows of 1k bytes each. That comes to about 100 megs, which is a pretty small database by modern standards.

Now let's say you want to store the dumps of that database in a version control system of some sort. 100 megs is a pretty huge file by the standards of version control software. Even if you've only changed one row, some VCS programs will upload the entire new version to the server, then do the checksumming on the server side. (I'm not sure of the exact case with svn, but I'm sure it will re-upload the whole file if you check it into a newly-created branch or as a new file, even if some other branch already has a similar file.) Alternatively, git will be reasonably efficient on the wire, but only after it slurps up mind-boggling amounts of RAM trying to create a multi-level xdelta of various revisions of the file (and to do that, it needs to load multiple revisions into memory at once). It also needs you to have the complete history of all prior backups on the computer doing the upload, which is kind of silly.

Neither of those alternatives is really very good. What's a better system?

Well, rsync is a system that works pretty well for syncing small changes to giant files. It uses a rolling checksum to figure out which chunks of the giant file need to be transferred, then sends only those chunks. Like magic, this works even if the sender doesn't have the old version of the file.

Unfortunately, rsync isn't really perfect for our purposes either. First of all, it isn't really a version control system. If you want to store multiple revisions of the file, you have to make multiple copies, which is wasteful, or xdelta them, which is tedious (and potentially slow to reassemble, and makes it hard to prune intermediate versions), or check them into git, which will still melt down because your files are too big. Plus rsync really can't handle file renames properly - at all.

Okay, what about another idea: let's split the file into chunks, and check each of those blocks into git separately. Then git's delta compression won't have too much to chew on at a time, and we only have to send modified blocks...

Yes! Now we're getting somewhere. Just one catch: what happens if some bytes get inserted or removed in the middle of a file? Remember, this is a database dump: it's plaintext. If you're splitting the file into equal-sized chunks, every chunk boundary after the changed data will be different, so every chunk will have changed.

This sounds similar to the rsync+gzip problem. rsync really sucks by default on .tar.gz files, because if a single byte changes, every compressed byte after that will be different. To solve this problem, they introduced gzip --rsyncable, which uses a clever algorithm to "resync" the gzip bytestream every so often. And it works! tar.gz files compressed with --rsyncable change only a little if the uncompressed data changes only a little, so rsync goes fast. But how do they do it?

Here's how it works: gzip keeps a rolling checksum of the last, say, 32 bytes of the input file. (I haven't actually looked at what window size gzip uses.) If the last n bits of that checksum are all 0, which happens, on average, every 2^n bytes or so, then toss out the gzip dictionary and restart the compression as if that were the beginning of the file. Using this method, a chunk ends every time we see a conforming 32-byte sequence, no matter what bytes came before it.

So here's my trick: instead of doing this algorithm in gzip, I just do it myself in a standalone program. Then I write each chunk to a file, and create an index file that simply lists the filenames of the required chunks (in order). Naturally, I name each chunk after its SHA1 hash, so we get deduplication for free. (If we create the same chunk twice, it'll have the same name, so it doesn't cost us any space.)

...and to be honest, I got a little lazy when it came to creating the chunks, so I just piped them straight to git hash-object --stdin -w, which stores and compresses the objects and prints out the resulting hash codes.

An extremely preliminary experimental proof-of-concept implentation of this file splitting algorithm is on github. It works! My implementation is horrendously slow, but it will be easy to speed up; I just wrote it as naively as possible while I was waiting for the laundry to finish.

Future Work

For our purposes at EQL Data, it would be extremely cool to have the chunking algorithm split based only on primary key text, not the rest of the row. We'd also name each file based on the first primary key in the file. That way, typical chunks will tend to have the same set of rows in them, and git's normal xdelta stuff (now dealing with a bunch of small files instead of one huge one) would be super-efficient.

It would also be entertaining to add this sort of chunking directly into git, so that it could handle huge files without barfing. That would require some changes to the git object store and maybe the protocol, though, so it's not to be taken lightly.

And while we're dreaming, this technique would also be hugely beneficial to a distributed filesystem that only wants to download some revisions, rather than all of them. git's current delta compression works great if you always want the complete history, but that's not so fantastic if your full history is a terabyte and one commit is 100 GB. A distributed filesystem is going to have to be able to handle sparse histories, and this chunking could help.

Prior Art

I came up with this scheme myself, obviously heavily influenced by git and rsync. Naturally, once I knew the right keywords to search for, it turned out that the same chunking algorithm has already been done: A Low-Bandwidth Network Filesystem. (The filesystem itself looks like a bit of a dead end. But they chunk the files the same way I did and save themselves a lot of bandwidth by doing so.)

2009-10-04 »