|Oh well, at least it's
Everything here is
my personal opinion.
I do not speak
for my employer.
Version control of really huge files
So let's say you've got a database with a 100k rows of 1k bytes each. That comes to about 100 megs, which is a pretty small database by modern standards.
Now let's say you want to store the dumps of that database in a version control system of some sort. 100 megs is a pretty huge file by the standards of version control software. Even if you've only changed one row, some VCS programs will upload the entire new version to the server, then do the checksumming on the server side. (I'm not sure of the exact case with svn, but I'm sure it will re-upload the whole file if you check it into a newly-created branch or as a new file, even if some other branch already has a similar file.) Alternatively, git will be reasonably efficient on the wire, but only after it slurps up mind-boggling amounts of RAM trying to create a multi-level xdelta of various revisions of the file (and to do that, it needs to load multiple revisions into memory at once). It also needs you to have the complete history of all prior backups on the computer doing the upload, which is kind of silly.
Neither of those alternatives is really very good. What's a better system?
Well, rsync is a system that works pretty well for syncing small changes to giant files. It uses a rolling checksum to figure out which chunks of the giant file need to be transferred, then sends only those chunks. Like magic, this works even if the sender doesn't have the old version of the file.
Unfortunately, rsync isn't really perfect for our purposes either. First of all, it isn't really a version control system. If you want to store multiple revisions of the file, you have to make multiple copies, which is wasteful, or xdelta them, which is tedious (and potentially slow to reassemble, and makes it hard to prune intermediate versions), or check them into git, which will still melt down because your files are too big. Plus rsync really can't handle file renames properly - at all.
Okay, what about another idea: let's split the file into chunks, and check each of those blocks into git separately. Then git's delta compression won't have too much to chew on at a time, and we only have to send modified blocks...
Yes! Now we're getting somewhere. Just one catch: what happens if some bytes get inserted or removed in the middle of a file? Remember, this is a database dump: it's plaintext. If you're splitting the file into equal-sized chunks, every chunk boundary after the changed data will be different, so every chunk will have changed.
This sounds similar to the rsync+gzip problem. rsync really sucks by default on .tar.gz files, because if a single byte changes, every compressed byte after that will be different. To solve this problem, they introduced gzip --rsyncable, which uses a clever algorithm to "resync" the gzip bytestream every so often. And it works! tar.gz files compressed with --rsyncable change only a little if the uncompressed data changes only a little, so rsync goes fast. But how do they do it?
Here's how it works: gzip keeps a rolling checksum of the last, say, 32 bytes of the input file. (I haven't actually looked at what window size gzip uses.) If the last n bits of that checksum are all 0, which happens, on average, every 2^n bytes or so, then toss out the gzip dictionary and restart the compression as if that were the beginning of the file. Using this method, a chunk ends every time we see a conforming 32-byte sequence, no matter what bytes came before it.
So here's my trick: instead of doing this algorithm in gzip, I just do it myself in a standalone program. Then I write each chunk to a file, and create an index file that simply lists the filenames of the required chunks (in order). Naturally, I name each chunk after its SHA1 hash, so we get deduplication for free. (If we create the same chunk twice, it'll have the same name, so it doesn't cost us any space.)
...and to be honest, I got a little lazy when it came to creating the chunks, so I just piped them straight to git hash-object --stdin -w, which stores and compresses the objects and prints out the resulting hash codes.
An extremely preliminary experimental proof-of-concept implentation of this file splitting algorithm is on github. It works! My implementation is horrendously slow, but it will be easy to speed up; I just wrote it as naively as possible while I was waiting for the laundry to finish.
For our purposes at EQL Data, it would be extremely cool to have the chunking algorithm split based only on primary key text, not the rest of the row. We'd also name each file based on the first primary key in the file. That way, typical chunks will tend to have the same set of rows in them, and git's normal xdelta stuff (now dealing with a bunch of small files instead of one huge one) would be super-efficient.
It would also be entertaining to add this sort of chunking directly into git, so that it could handle huge files without barfing. That would require some changes to the git object store and maybe the protocol, though, so it's not to be taken lightly.
And while we're dreaming, this technique would also be hugely beneficial to a distributed filesystem that only wants to download some revisions, rather than all of them. git's current delta compression works great if you always want the complete history, but that's not so fantastic if your full history is a terabyte and one commit is 100 GB. A distributed filesystem is going to have to be able to handle sparse histories, and this chunking could help.
I came up with this scheme myself, obviously heavily influenced by git and
rsync. Naturally, once I knew the right keywords to search for, it turned
out that the same chunking algorithm has already been done: A
Low-Bandwidth Network Filesystem. (The filesystem itself looks like a
bit of a dead end. But they chunk the files the same way I did and save
themselves a lot of bandwidth by doing so.)
-- Wired, The Curse of Xanadu2
Computers are all about control. 1950's sci-fi was all about the fear of artificial intelligence; about what would happen if every decision was all about logic. But the strangeness of computing is much deeper than that. In a computer, you can change the rules of the universe, just to see what happens. And those changes will reflect onto the outside world.
One of those rule changes is version control. Humans have an overwhelming tendency to forget things; mostly we remember just one version of a person, the last we saw of them, or at best a series of snapshots. Most things that happened to us, we forget altogether. In the short term, we remember a lot; in the long term, we remember less; in the last 10 seconds, we can often replay it verbatim in our heads.
Computers are different. If we want a computer to remember something, we have to tell it to remember. If we want it to forget something, we have to tell it to forget. Because we find that tedious, we define standard rules for remembering and forgetting. With the exception of a few projects, like Xanadu or the Plan 9 WORM filesystem, the standard rules are: remember the current version. Forget everything from before. Some programs, like wikis and banking systems, don't follow the standard rules. For each of those programs, someone wrote explicit code for what to remember and what to forget.
But the standard rules are on the verge of changing.
Cheap disks are now so unbelievably gigantic that most people can only do one of two things with them: pirate more full-frame video content than they will ever have the time to watch, or simply stop deleting stuff. Many people do both.
But another option is starting to emerge: storing old revisions. The technology is advancing fast, and for some sorts of files, systems like git can store their complete history in less than the space of the latest uncompressed data. People never used to think that was even possible; now it's typical.
For some sorts of files, the compression isn't good enough. For large files, you have to use tricks that haven't been finalized yet. And git cheats a little: it doesn't really store every revision. It only stores the revisions you tell it to. For a programmer, that's easy, but for normal people, it's too hard. If you really saved every revision, you'd use even more space, and you'd never be able to find anything.
Back at NITI, we invented (and then patented) a backup system with a clever expiry algorithm based on the human mind: roughly speaking, it backs up constantly, but keeps more of the recent versions and throws away more of the older ones. So you have one revision every few minutes today and yesterday, but only one for the day before, and only one for last week, one for last month and the month before that, etc.1
As it happens, the backup system we invented wasn't as smart as git. It duplicated quite a lot of data, thus wasting lots of disk space, in order to make it easier to forget old versions. Git's "object pack" scheme is much more clever, but git has a problem: it only knows how to add new items to the history. It doesn't know how to forget.
But as with so many things about git, that's not entirely true.
Git forgets things frequently. In fact, even when git is forgetting things, it's cleverer than most programs. Git is the only program I've ever seen that uses on-disk garbage collection. Whenever it generates a temporary object, it just writes it to its object store. Then it creates trees of those objects, and writes the tree indexes to the object store. And then it links those trees into a sequence of commits, and stores them in the object store. And if you created a temporary object that doesn't end up in a commit? Then the object sticks around until the next git gc - garbage collection.
When I wrote my earlier article about version control for huge files, some people commented that this is great, but it's not really useful as a backup system, because you can't afford to keep every single revision. This is true. The ideal backup system features not just remembering, but forgetting.
Those tools are still too complicated for normal humans to operate. But someday, someone will write a git skiplist that indexes your commits in a way that lets you drop some out from the middle without breaking future merges. It's not that hard.
When git can handle large files, and git learns to forget, then it'll be time to revisit those standard rules of memory. What will we do then?
1 Actually it was rather more complicated than that, but that's the general idea. Apple's Time Machine, which came much later, seems to use almost exactly the same algorithm, so it might be a patent violation. But that's not my problem anymore, it's IBM's, and Apple and IBM surely have a patent cross-license deal by now.
2 By the way, I first read that Xanadu article a few years ago,
and it's completely captivating. You should read it. Just watch out: it's
Paul Buchheit on "Hacking"
Buchheit has produced a really good article that, at last, clearly describes the nature of "hacking."
I especially like how he handles the debate about hacking being a good thing ("clever hack") or a bad thing (eg. illegal breakins). Some people propose that we use two different words ("hacker" and "cracker"), but those never quite feel right. The essay explains why.
-- Paul Buchheit, Applied Philosophy, aka Hacking
October 14, 2009 17:37
Linux in a Nutshell, 6th Edition...
...has a new chapter about Git, courtesy of me.
Sorry for the late notice. I keep thinking of awesome stuff to write about, but not quite getting around to it because I'm too busy. Somehow it's the opposite of writer's block, but has the same net effect.
As computer books go, Linux in a Nutshell is surprisingly awesome. I've been using Linux pretty heavily since 1994, but I can still flip to a random page in this book and learn something new.
Unless I flip to the Git chapter, of course.
Bittorrent, negative latency, and feedback control theory
The basic idea was simple enough: the Linux kernel supports traffic shaping, which allows you to limit and control the amount of data you send/receive. Limiting the data you receive isn't all that useful, as it turns out, but limiting the send rate can be very useful.
If you transmit data at the maximum rate (say, 50k/sec), you'll end up filling your DSL modem's buffer, and then everything you transmit ends up having a multi-second delay, which results in horrendous latency.
If you transmit data at just slightly less than the maximum rate, say 49.9/sec, the buffer never fills up at all, and your latency is still the minimum. So it's not using your link that makes things unresponsive; it's overfilling the transmit buffer.
The problem: you don't actually know what your uplink rate is, so picking that 99% rate automatically isn't easy. That's why BitTorrent clients let you limit your uplink speed.
At NITI, we observed that latency creeps up right away when you exceed the maximum rate. So we ought to be able to detect that maximum rate by monitoring the latency and using that as feedback into a bandwidth limiter. Basically, a simple feedback control system.
This almost, but not quite, worked. It would in fact work great most of the time, but eventually it would always go into a crazy state in which it kept reducing the transmit rate without having any luck reducing the bandwidth... so it would reduce the transmit rate further out of desperation, and so on. The results made it basically unusable. Too bad. (We never had enough time to fully debug it... some other priority always got in the way.)
Moreover, it wasn't any use to you if you didn't have Nitix.
Anyway, all this is to say that the Bittorrent people have been thinking about the same problems lately, and have supposedly solved it as part of the uTorrent Transport Protocol (UTP). (There's also an IETF protocol called LEDBAT that seems to be related.)
Their approach is similar to what we were doing, but has a few changes that make it more likely to actually work.
First of all, they assume the "minimum achievable latency" is the lowest latency you've seen in the last 3 minutes. Rather than using averages, they observe that if the transmit buffer is always near-empty, then sooner or later you'll get a packet through without any buffer delay. The delay of that packet is the actual network latency; on top of that, anything extra is buffering delay.
Secondly, because they're coming up with a whole new protocol rather than throttling existing TCP sessions, they can add a timestamp to each packet. Basically, that means they can figure out the one-way latency without sending extra packets. Our system required sending out ping packets, which could only measure the full round-trip time (when really you need to measure each direction independently). They also know when they're transmitting at the maximum allowed rate and when they're mostly idle, so they can keep their statistics straight.
Furthermore, their approach focuses on the core of the problem: don't bother limiting overall upload throughput, just limit the rude part of the throughput. They've correctly noted that, almost always, when transmit buffers cause a problem, it's because of BitTorrent. Other than that, almost nobody uses much upload bandwidth at all. So they've limited their solution to only the BitTorrent protocol. That way they don't have to convince anyone else (router manufacturers, operating system kernels, etc) to support their standard.
Now, at last, BitTorrent can be polite. BitTorrent uploads are almost always the lowest-priority thing you could possibly be doing. So it's okay that it always loses out to the slightly-less-polite TCP. (Apparently TCP Vegas is a more polite version of TCP that would accomplish the same thing... if everybody used it. But it requires kernel support, and most kernels supposedly make you choose Vegas globally for all connections, not just for low-priority ones. Which you will never do, because it'll make your whole computer lower priority than everybody else's computers, and thus your personal Internet performance will suck.)
Negative latency and background transmissions
The ability to send data truly "in the background" without interfering with high-priority foreground communications is important. It allows you to implement what I call "negative latency" - transmission of data before anyone requests it.
Disks are getting bigger and bigger, and many computers spend a lot of time sitting idle on the Internet. During that time, they could be "pre-sending" data you might need later. If sending that data had no cost, then even if 99% of the data turned out to be useless, you'd still have a 1% improvement that is worthwhile. And personally, I think a much better than 1% success rate should be possible.
I'm looking forward to it.