Fair

and "balanced"
Everything here is my opinion. I do not speak for your employer.
March 2008
April 2008

2008-03-27 »

A tale of five merges, part 5: darcs and patch theory

This is not really an article about darcs. Actually, I've never used darcs. It's not even really about patch theory, because I'm too lazy to read through the actual article on patch theory in enough detail.

What this really is is a nice way to finish off my series of articles on version control and merging concepts, by discussing some features that a program like darcs might or might not have, because of some theory that patch theory might or might not provide. In other words, here are some good ideas of unknown origin that I hope git will steal. But I'll call our imaginary system that might have these features darcs, because it ought to have a name. My apologies to the actual darcs maintainers.

So far, we've talked about branch merging in CVS, SVN, and git, both with git-merge and git-rebase. We've been addressing each one in terms of four main questions: (A) how automatic is it, (B) what happens to the change history, (C) what happens when you merge back and forth multiple times between two branches, and (D) what happens when you "cherry pick" individual changes, usually bugfixes, from one branch to another.

What we find is that with all the systems discussed so far, most of these answers end up being some kind of compromise. None of the options presented so far has really been entirely what we want. So what would the ideal solution be like?

A note on versioned storage

Before we can talk about the world of perfect version merging, we have to talk about how those versions are stored in the first place. There are two basic philosophies about that: (a) store the files, or (b) store the patches.

Speaking very roughly and inaccurately, SVN and CVS store one version and then a bunch of patches against that version. git stores just stores the files.(1) Systems like darcs, on the other hand, prefer to talk about groups of patches, rather than the files themselves.

The results of these philosophical decisions are very interesting. For example, if you branch and merge a lot in SVN, your repository gets a lot bigger, because you store the patches from A to B, and from B to C, and from A to A', and from B to B', and from C to C', and so on. All those patches will be slightly different, even if the resulting files are mostly the same, so it's impossible to eliminate any redundancy (except trivially with something like gzip). git, on the other hand, stores just the files, and when files are the same (which is very frequently), it doesn't store them over again. That means that in git, branching is basically always free.

darcs, on the third hand, mostly stores each version as a big pile of patches, and it stores them in a particularly efficient way: if it is possible to use the patch from A to A' to convert from B to B', even if the patch application is a bit "fuzzy", then we just add that patch to the pile for B'. Similarly if it will work to convert C to C'. So in darcs, branches are about as cheap as in git.

Now, regardless of which storage method you choose, you can obviously convert quite easily from files to patches and back. Version control systems do this all the time; if you do "git show" or "git diff", it's converting from files to patches; if you do "svn checkout", it's converting from patches to files. The only difference is that certain operations go faster than others depending on your storage format.

Why git is so fast

For example, "git clone" (the equivalent of "svn checkout") goes much faster than checking out of any of the other systems, because no conversion is necessary. On the other hand, "git diff" might be faster or slower than a diff on one of the other systems; if you're asking about a patch that darcs has already stored, darcs will probably be faster, because git has to generate a patch. But if you're asking about the difference between two arbitrary sets of files, darcs will first have to figure out what those files look like, then generate the patch, which is two steps to git's one, so git will be faster.

And this is where the fun starts. Because of git's storage system, some operations are just plain amazingly fast: such as creating a commit. The secret to git's speed is that making a checkin just involves copying the changed files verbatim into the repository; it doesn't think about deltas, or patches, or anything. And checking out any version, or creating a branch, involves simply copying a file directly out of the repository. Even finding the differences between any two versions simply requires reading the index to see which files have different SHA-1 hashes, and doing a diff between them. And the "git merge" operation does exactly that: it figures out the differences from BEFORE to AFTER, and applies them to TARGET, and checks in the result. All of these operations are about as fast as they get.

Where git gets slow is exactly where darcs gets fast. In darcs, every merge is like a "batch cherry picking" operation in git. When there are no patch conflicts - and darcs' "patch theory" will tell you in advance if there's a conflict - then darcs can cherry pick a batch of patches in the same time it takes for git to make an empty commit. For git, on the other hand, every single cherry you pick takes the same time as a git-merge: checkout, diff, patch, and checkin.

git-rebase is just a big batch of cherry picking, and it's slow in git. It would be a non-event in darcs.

git-svn uses git-rebase extensively, because it pulls patches out of svn and then rebases all your work on top of it. In darcs, the patches would just go straight into the repository, no questions asked, and the conflicts could be resolved later.

But much more importantly than the speed, cherry picking patches (and thus "rebasing" branches) in darcs retains the history of those patches; after all, they're just the same patches in another place, and all there is is patches. Where history in SVN is a line, and history in git is a helix, history in darcs... well, we don't like to talk about "history" in darcs. We talk about... Patch Commutation and Inverses.(2) Erm, okay, maybe another time.

The major difference in the way darcs thinks about these things is in what happens during a merge. In git, we noted that merging two branches in either "direction" always results in the same merged result, ie. set of merged files. But in darcs, you don't talk about files, you talk about patches, and the merged result is not the same: the two merges have the same set of patches, but the patches are in a different order.

The order of the patches is the missing information that SVN's simple-minded merge history emphasizes, and git's merge history hides. Moreover, git-rebase is all about changing the order of patches, and since git doesn't really know anything about patches, that's exactly why it makes git-merge so angry.

How does darcs do back-and-forth merges between two branches? Well, that's easy: you simply look at the list of patches in the source branch, and the list of patches in the destination branch, and you add the patches to the destination that weren't already there.

How about reordering and recombining patches, like in git's interactive rebase? Well, if "patch theory" says there aren't any conflicts, then feel free to reorder them; darcs already knows it'll work, so it doesn't actually need to calculate anything. It finishes instantly, and doesn't mess up future merges to or from other branches.(3) That's way better than git-rebase.

A perfect world

At last, we come to the end of our series! Let's put it all together. What would an ideal system look like?

My suggestion would be to base the system on git - because it's fast for everyday operations like checkout, checkin, diff, and merge - but add some elements of patch theory. Most importantly, it's not okay for a git-cherry-pick or git-rebase operation to forget the old commits that the new commits were based on. They need to store a "patch parent" field that basically says, "this commit is what happens when you take the difference from BEFORE my patch parent to AFTER my patch parent, and apply it to my PARENT." Then git-merge will have some extra work to do sometimes to unwind the criss-crossing patches, but it's not the end of the world: after the next git-merge between a given two branches, we're entirely caught up and everything is easy again. It seems a small price to pay in order to have git-svn, git-rebase, and git-filter-branch all magically work properly with git-merge and git-pull.

So how does our imaginary new system stack up with respect to our four key questions?

Our newfangled merging system: (A) allows you to rearrange patches as easily as git-rebase, and yet still merge as easily as git-merge; (B) allows you to shuffle and clean up change history at will, while still tracing (via patch-parent) into the detailed original patches if you want; (C) easily handles back-and-forth merges as well as git-merge or darcs, and even if you've been using git-rebase; (D) easily supports cherry picking without disrupting future back-and-forth merges.

Yeah, yeah, I know what you're going to say. I should go write it myself. Well, maybe someday I will. But don't hold your breath. :)

Footnotes

(1) This is really true, honest. The existence of "git repack", which rewrites groups of files using xdelta encodings, is really an optimization. It's not how git actually thinks about things.

(2) That's probably why most people don't use darcs.

(3) There are actually huge theoretical problems with doing this, so you wouldn't actually implement it that way. Imagine you had a patch to add a function declaration to foo.h, and another one to add the function body in foo.c, and another one to call the function and include the header from blue.c. If you check in the changes in that order, you'll have a compiling program at each step. And these three patches are "independent" according to patch theory, since they all change different files. But if you reorder them and change blue.c before foo.h and foo.c, your program won't actually work. So arbitrarily reordering patches produces software versions that never actually existed and were never tested and thus might not work, which is of limited use. That's why "patch theory" as I've described is mostly "theory" and not "practice", even in darcs.

I'm CEO at Tailscale, where we make network problems disappear.

Why would you follow me on twitter? Use RSS.

apenwarr on gmail.com