Version control of really huge files
So let's say you've got a database with a 100k rows of 1k bytes each.
That comes to about 100 megs, which is a pretty small database by modern
standards.
Now let's say you want to store the dumps of that database in a version
control system of some sort. 100 megs is a pretty huge file by the
standards of version control software. Even if you've only changed one row,
some VCS programs will upload the entire new version to the server, then do
the checksumming on the server side. (I'm not sure of the exact case with
svn, but I'm sure it will re-upload the whole file if you check it
into a newly-created branch or as a new file, even if some other branch
already has a similar file.) Alternatively, git will be reasonably efficient
on the wire, but only after it slurps up mind-boggling amounts of RAM trying
to create a multi-level xdelta of various revisions of the file (and to do
that, it needs to load multiple revisions into memory at once). It also
needs you to have the complete history of all prior backups on the computer
doing the upload, which is kind of silly.
Neither of those alternatives is really very good. What's a better system?
Well, rsync is a
system that works pretty well for syncing small changes to giant files. It
uses a rolling
checksum to figure out which chunks of the giant file need to be
transferred, then sends only those chunks. Like magic, this works even if
the sender doesn't have the old version of the file.
Unfortunately, rsync isn't really perfect for our purposes either. First of
all, it isn't really a version control system. If you want to store
multiple revisions of the file, you have to make multiple copies, which is
wasteful, or xdelta them, which is tedious (and potentially slow to
reassemble, and makes it hard to prune intermediate versions), or check them
into git, which will still melt down because your files are too big. Plus
rsync really can't handle file renames properly - at all.
Okay, what about another idea: let's split the file into chunks, and check
each of those blocks into git separately. Then git's delta compression
won't have too much to chew on at a time, and we only have to send modified
blocks...
Yes! Now we're getting somewhere. Just one catch: what happens if some
bytes get inserted or removed in the middle of a file? Remember, this is a
database dump: it's plaintext. If you're splitting the file into
equal-sized chunks, every chunk boundary after the changed data will be
different, so every chunk will have changed.
This sounds similar to the rsync+gzip problem. rsync really sucks by
default on .tar.gz files, because if a single byte changes, every compressed
byte after that will be different. To solve this problem, they introduced
gzip
--rsyncable, which uses a clever algorithm to "resync" the gzip
bytestream every so often. And it works! tar.gz files compressed with
--rsyncable change only a little if the uncompressed data changes only a
little, so rsync goes fast. But how do they do it?
Here's how it works: gzip keeps a rolling checksum of the last, say, 32
bytes of the input file. (I haven't actually looked at what window size
gzip uses.) If the last n bits of that checksum are all 0,
which happens, on average, every 2^n bytes or so, then toss out the gzip
dictionary and restart the compression as if that were the beginning of the
file. Using this method, a chunk ends every time we see a conforming
32-byte sequence, no matter what bytes came before it.
So here's my trick: instead of doing this algorithm in gzip, I just do it
myself in a standalone program. Then I write each chunk to a file, and
create an index file that simply lists the filenames of the required chunks
(in order). Naturally, I name each chunk after its SHA1 hash, so we get
deduplication for free. (If we create the same chunk twice, it'll have the
same name, so it doesn't cost us any space.)
...and to be honest, I got a little lazy when it came to creating the
chunks, so I just piped them straight to git hash-object --stdin -w,
which stores and compresses the objects and prints out the resulting hash
codes.
An extremely preliminary
experimental proof-of-concept implentation of this file splitting
algorithm is on github. It works! My implementation is horrendously
slow, but it will be easy to speed up; I just wrote it as naively as
possible while I was waiting for the laundry to finish.
Future Work
For our purposes at EQL Data, it would be
extremely cool to have the chunking algorithm split based only on primary
key text, not the rest of the row. We'd also name each file based on the
first primary key in the file. That way, typical chunks will tend to have
the same set of rows in them, and git's normal xdelta stuff (now dealing
with a bunch of small files instead of one huge one) would be
super-efficient.
It would also be entertaining to add this sort of chunking directly into
git, so that it could handle huge files without barfing. That would
require some changes to the git object store and maybe the protocol, though,
so it's not to be taken lightly.
And while we're dreaming, this technique would also be hugely beneficial to
a distributed filesystem that only wants to download some revisions, rather
than all of them. git's current delta compression works great if you always
want the complete history, but that's not so fantastic if your full history
is a terabyte and one commit is 100 GB. A distributed filesystem is going
to have to be able to handle sparse histories, and this chunking could help.
Prior Art
I came up with this scheme myself, obviously heavily influenced by git and
rsync. Naturally, once I knew the right keywords to search for, it turned
out that the same chunking algorithm has already been done: A
Low-Bandwidth Network Filesystem. (The filesystem itself looks like a
bit of a dead end. But they chunk the files the same way I did and save
themselves a lot of bandwidth by doing so.)
January 3, 2010 11:37