Stuffing the stuff

without getting stuffy
Everything here is my personal opinion. I do not speak for my employer.
October 2009
November 2009

2009-10-06 »

Forgetting

    Thinking is based on selection and weeding out; remembering everything is strangely similar to forgetting everything. "Maybe most things that people do shouldn't be remembered," Jellinghaus says. "Maybe forgetting is good."

    -- Wired, The Curse of Xanadu2

Computers are all about control. 1950's sci-fi was all about the fear of artificial intelligence; about what would happen if every decision was all about logic. But the strangeness of computing is much deeper than that. In a computer, you can change the rules of the universe, just to see what happens. And those changes will reflect onto the outside world.

One of those rule changes is version control. Humans have an overwhelming tendency to forget things; mostly we remember just one version of a person, the last we saw of them, or at best a series of snapshots. Most things that happened to us, we forget altogether. In the short term, we remember a lot; in the long term, we remember less; in the last 10 seconds, we can often replay it verbatim in our heads.

Computers are different. If we want a computer to remember something, we have to tell it to remember. If we want it to forget something, we have to tell it to forget. Because we find that tedious, we define standard rules for remembering and forgetting. With the exception of a few projects, like Xanadu or the Plan 9 WORM filesystem, the standard rules are: remember the current version. Forget everything from before. Some programs, like wikis and banking systems, don't follow the standard rules. For each of those programs, someone wrote explicit code for what to remember and what to forget.

But the standard rules are on the verge of changing.

Cheap disks are now so unbelievably gigantic that most people can only do one of two things with them: pirate more full-frame video content than they will ever have the time to watch, or simply stop deleting stuff. Many people do both.

But another option is starting to emerge: storing old revisions. The technology is advancing fast, and for some sorts of files, systems like git can store their complete history in less than the space of the latest uncompressed data. People never used to think that was even possible; now it's typical.

For some sorts of files, the compression isn't good enough. For large files, you have to use tricks that haven't been finalized yet. And git cheats a little: it doesn't really store every revision. It only stores the revisions you tell it to. For a programmer, that's easy, but for normal people, it's too hard. If you really saved every revision, you'd use even more space, and you'd never be able to find anything.

Back at NITI, we invented (and then patented) a backup system with a clever expiry algorithm based on the human mind: roughly speaking, it backs up constantly, but keeps more of the recent versions and throws away more of the older ones. So you have one revision every few minutes today and yesterday, but only one for the day before, and only one for last week, one for last month and the month before that, etc.1

As it happens, the backup system we invented wasn't as smart as git. It duplicated quite a lot of data, thus wasting lots of disk space, in order to make it easier to forget old versions. Git's "object pack" scheme is much more clever, but git has a problem: it only knows how to add new items to the history. It doesn't know how to forget.

But as with so many things about git, that's not entirely true.

Git forgets things frequently. In fact, even when git is forgetting things, it's cleverer than most programs. Git is the only program I've ever seen that uses on-disk garbage collection. Whenever it generates a temporary object, it just writes it to its object store. Then it creates trees of those objects, and writes the tree indexes to the object store. And then it links those trees into a sequence of commits, and stores them in the object store. And if you created a temporary object that doesn't end up in a commit? Then the object sticks around until the next git gc - garbage collection.

When I wrote my earlier article about version control for huge files, some people commented that this is great, but it's not really useful as a backup system, because you can't afford to keep every single revision. This is true. The ideal backup system features not just remembering, but forgetting.

Git is actually capable of forgetting; there are tools like git subtree, for pulling out parts of the tree, and git filter-branch, for pulling out parts of your history.

Those tools are still too complicated for normal humans to operate. But someday, someone will write a git skiplist that indexes your commits in a way that lets you drop some out from the middle without breaking future merges. It's not that hard.

When git can handle large files, and git learns to forget, then it'll be time to revisit those standard rules of memory. What will we do then?

Footnotes

1 Actually it was rather more complicated than that, but that's the general idea. Apple's Time Machine, which came much later, seems to use almost exactly the same algorithm, so it might be a patent violation. But that's not my problem anymore, it's IBM's, and Apple and IBM surely have a patent cross-license deal by now.

2 By the way, I first read that Xanadu article a few years ago, and it's completely captivating. You should read it. Just watch out: it's long.

Why would you follow me on twitter? Use RSS.
apenwarr-on-gmail.com