Stuffing the stuff

without getting stuffy
Everything here is my opinion. I do not speak for your employer.
April 2009
May 2009

2009-04-30 »

A new alternative to git submodules: git subtree

Since I first started using git, I've been annoyed by its support (or lack thereof) for "submodules" - the ability to embed the source code for a library, for example, into the source code for your project, and branch/merge them all as a group.

I do this all the time. As you might have guessed from reading this journal, I have worked on a huge number of different things, sometimes many of them in the same day. It turns out that a lot of these projects can benefit from sharing code with each other.

Now, git does have some rudimentary support for submodules through the git submodule command, but it's extremely complicated and error prone, particularly when it comes to switching branches and renaming directories. git checkout branchname is no longer enough to correctly change from one branch to another; git commit && git push is no longer enough to be sure all your changes have made it safely to a shared repository. Perhaps these problems will all be resolved someday, but it isn't looking like that day will be soon.

The problem is that git submodule is designed around a different workflow than I use for myself. If we were building, for example, a "superproject" that contains tons of tools from other sources, and the source code of those tools was huge and rapidly-changing, and the result was a repository 1GB in size or more, then it might make sense to slice and dice your repository and to update the sub-repositories as seldom as possible. Well, maybe - after all, your git history isn't very useful if you can't actually check out some of the old revisions - but apparently some people like it that way. I don't.

What I prefer is to have one big repository with all the stuff my application needs. Of course, what I also need is to allow a bunch of additional small repositories, one for each subproject, so that all my other projects that depend on those subprojects can easily pull from them.

The first hint of a solution to this is what's called the "subtree merge" strategy in git. If you drop a copy of your library in the mylib/ folder in your project, then you can git pull -s subtree from your library in the future, and it'll "automagically" end up in that subdirectory where it belongs. So merging stuff into your superproject from a particular subproject, although it's a little known feature, is basically a solved problem.

The real problem comes when you want to get those changes back out again. Say you make some changes to the library to support your project, and commit them into your project. (Most libraries I write are written at the same time as my applications, because I only develop features as I need them. YAGNI, after all.) Now you need to take those changes out of your application's project and put them in the primary library's project.

If you use git submodule, this is easy, because they were never really merged into your application project in the first place. You were forced to tediously maintain separate projects for each library all along. But if you use a subtree merge, you're in trouble! The two histories are all intermingled, and there's no way to extract one from the other.

...until now. My new git subtree command lets you easily split out the history of a subdirectory, and auto-join it with the original subproject's history so that you can easily push or pull just that piece into the subproject. I call this command git subtree split, the companion to git merge -s subtree.

As an added convenience, there are add, merge, and pull subcommands that make the other common subtree operations a little easier to remember.

Example

Here's an example you can try with the git.git repository. Once upon a time, the gitweb project was maintained separately from git in its own project. They merged it (commit 11e0ef3) into git at one point (commit 0a8f4f0), from which time it was maintained as part of git. Now imagine we wanted to re-extract the gitweb changes from git.git back into the original gitweb project so it could be maintained separately again. Here's what we'd do (you can try this at home if you download git-subtree):

    git clone git://git2.kernel.org/pub/scm/git/git.git
    newtree=$(git subtree split --prefix=gitweb --annotate='(split) ' \
           0a8f4f0^.. --onto=1130ef3 --rejoin)
    git branch latest_gitweb $newtree
    gitk latest_gitweb

If gitweb had originally been merged using 'git subtree add' (or a previous split had been done with --rejoin specified), then you could have left out the weird "0a8f4f0^.." and "--onto=1130ef3" parameters. In fact, after the first split --rejoin, you can incrementally get new changes in the future without remembering any commit ids:

    git subtree split --prefix=gitweb --annotate='(split) ' --rejoin

Current Status

git-subtree is now working, and you can get it (standalone) from its repository at github. It works fine with at least git 1.5.4 and newer, and possibly even earlier versions. I've been using it in some of my own projects. I've also submitted it to the git maintainers, so hopefully git-subtree (or something similar) will make it into future versions of git someday.

Please feel free to send me any questions/comments about git-subtree. It's my first major contribution to git (other than fixing some bugs in git-svn), and I'd like it to be extremely awesome, if possible.

Bonus

Note that, unlike git submodule, git subtree doesn't change the way people using your project need to work. As far as they're concerned, it's just one big project; nobody has to run (or install) git subtree unless they want to. It can just be the responsibility of a single person to extract the subproject history and upload it to the subproject repository, if you want.

::nv,li

I'm CEO at Tailscale, where we make network problems disappear.

Why would you follow me on twitter? Use RSS.

apenwarr on gmail.com