Git is the next Unix
When I first heard about git, I
was suspicious that there could be anything special about it, but after
watching Linus'
talk about it, I was... even more suspicious. I tried it anyway.
When I tried it, I realized something right away: what made git awesome was
actually none of the things Linus had talked about, not really. Those
things were more like... symptoms of the underlying awesomeness. Yes, git
is fast. Yes, it is distributed. Yes, it is definitely not CVS. Those
things are all great, but they miss the point.
What actually matters is that git is a totally new way to operate on
data. It changes the game. git has been described as "concept-heavy",
because it does so many things so differently from everything else. After
some reflection, I realized that this is far truer than I could see at
first. git's concepts are not only unusual, they're revolutionary.
Come on, revolutionary? It's just a version control system!
Actually it's not. Git was originally not a version control
system; it was designed to be the infrastructure so that someone else
could build one on top. And they did; nowadays there are more than 100
git-* commands installed along with git. It's scary and confusing and
weird, but what that means is git is a platform. It's a new set of
nouns and verbs that we never had before. Having new nouns and verbs means
we can invent entirely new things that we previously couldn't do.
Git is a new kind of filesystem, and it's faster than any filesystem I've
ever seen: git checkout
is faster than cp -a. It even has
fsck.
Git stores revision history, and it stores it in less space than any system
I've ever seen or heard of. Often, in less space than the original objects
themselves!
Git uses rsync-style hash authentication on everything, as well as a
new "tree of hashes" arrangement I haven't seen before, to enforce security
and reliability in amazing ways that make the idea of "guaranteed identical
every time" not something to strive for, but something that's always
irrevocably built in.
Git names everything using globally unique identifiers that nobody else will
ever accidentally use, so that being distributed is suddenly trivial.
Git is actually the missing link that has prevented me from building the
things I've wanted to build in the past.
I wanted to build a distributed
filesystem, but it was too much work. Now it's basically been done...
in userspace, cross-platform.
At NITI we built a file backup
system using what was a pretty clever data structure to speed up file
accesses. But we never got around to implementing sub-file deltas, because
we couldn't figure out a structure that would do it both quickly and
space-efficiently. With git, they did. To build your own backup system
that's much better than ours, just store it in git instead.
On top of our backup system we made a protocol for synchronizing changes
up to a remote repository. Our protocol was sort of okay; git's is much
better, and it will surely improve a lot in the months ahead. (Currently
git requires you to sync *everything* if you want to sync *anything*, but
that's an implementation restriction, not a design or protocol restriction.
See shallow
clones for just the beginning of this.)
Someone else I know built a hash-indexed backup system to
efficiently store incremental backups from a large number of systems on a
single set of disks. Git does the same, only even better, and supports
sub-file deltas too.
We made a diskless workstation platform called Expression
Desktop (now very dead). Knowing disks were cheap and getting cheaper,
we wanted to make it "diskful" eventually, automatically syncing itself from
a central server... but able to guarantee that it matched the server's files
exactly. We couldn't find a protocol to do it. git is that protocol.
I built a system on top of Nitix, called Versabox, that let you install
a Linux system on top of a Nitix system without virtualization. I wanted a
way to make it easy to install software into that Linux environment, then
repackage the entire thing as an all-in-one installer kit, but have the
archive contain both the original package and the new content; that way you
could upgrade either part without touching the other. To do that I invented
a new file format and tool, called versatar.
It works, and we use it at my new
company. But git would do it much better, and includes digital
signatures too for free.
Numerous people have written diff and merge systems for wikis; TWiki even uses RCS. If they used git instead,
the repository would be tiny, and you could make a personal copy of the
entire wiki to take on the plane with you, then sync your changes back when
you're done.
When Unix pipes were invented, suddenly it was trivially easy to do
something that used to be really hard: connect the output of one program to
the input of the next. Pipes were the fundamental insight that shaped the
face of Unix. Programs didn't have to be monolithic.
With git, we've invented a new world where revision history, checksums, and
branches don't make your filesystem slower: they make it faster.
They don't make your data bigger: they make it smaller. They don't
risk your data integrity; they guarantee integrity. They don't
centralize your data in a big database; they distribute it peer to
peer.
Much like Unix itself, git's actual software doesn't matter; it's the file
format, the concepts, that change everything.
Whether they're called git or not, some amazing things will come of this.
February 1, 2008 01:33