When there's only one

...there's only one choice
Everything here is my opinion. I do not speak for your employer.
September 2018
December 2018

2018-11-06 »

redo, buildroot, and serializing parallel logs

Lately I've been getting back to hacking on my djb redo implementation. We've fixed some problems with file attribute handling on NFS, obscure locking, and MacOS/FreeBSD portability. If you haven't tried redo in a while, you might want to give it another shot.

In case you haven't heard of redo before, here's the overview: it's like make, but with no special syntax (just sh scripts). The first time you "do" a build, it runs a set of recursive sh scripts, once per target. Those scripts can run a command called redo-ifchange, which declares dependencies on the given targets, checks if they are up to date, and if not, recurses into more scripts in order to build them. And that's it!

redo combines the best parts of imperative systems with the best parts of functional systems. The build scripts are all imperative - it just runs commands, and declaring dependencies happens as a side effect of some of those commands (redo-ifchange). When it's done, you have a purely-functional data structure that you can use for extremely fast dependency calculation. (Theoretically as fast as ninja, but my code isn't as optimized.)

[Credit: redo was invented by Daniel J. Bernstein. I merely implemented it.]

Parallelism

Things get a little more complex on modern multicore computers, where you almost always want parallel builds, which means producing different parts of the tree all at once, so in principle, a sequential-imperative tree of sh scripts is no longer the perfect model. Luckily, redo can handle it: if your script does redo-ifchange on more than one target at a time, it'll try to build all those in parallel. Then, if more than one parallel target tries to build a given dependency, it uses inter-process file locking to make sure the dependency only builds once.

But what's this about serializing logs?

I'm sure you already have your favourite build system and it builds things, and it almost certainly handles parallelism for whatever your use case. Even make does parallelism.

Where things tend to fall down is in rendering the output of a parallel build. When we're running a lot of jobs all at once, and blasting them all to stdout/stderr, and one step deep in the tree gets an error, then you might get pages and pages of successful output from other tasks interspersed with your error, making it very hard to figure out what went wrong and why.

There are various approaches to solving that. Some people would argue that the Unix Way is for programs that didn't fail to just print nothing at all; that's how the Go compiler works, for example. If you like that philosophy but you're using tools that don't agree (such as make itself, which prints all kinds of stuff while it works), you could wrap every command in a script that withholds its output, printing it only if the command returns a nonzero exit code.

That's all nice until you have to debug what went wrong. It's not a coincidence that make, which is made by Unix people, does not follow the Unix Way. Makefiles are just too complicated and hard to debug if you can't see what they're doing; and if step 10 goes wrong, you might be very curious about step 9, even though (nominally) it worked. It's not okay to throw away the successful log messages from step 9.

Fine. Parallel make output is flawed and gross. But everyone knows make is flawed and gross, so they switch to other systems. Most other popular build systems are tool-specific. Someone did a lot of work in cmake, for example, to make it print pretty messages during parallel builds of C/C++ programs. That works well. But if you're not building C/C++ programs, it can't help.

redo is a general purpose dependency system like make, so by definition it's going to run scripts which produce a lot of clutter, possibly including instances of make itself, and someone is going to have to debug them. What can we do to sanitize the logs?

Digression from 2012: loglinear

I've actually been thinking about this problem for more than six years already. Back in 2012, I added a log sanitizer script called loglinear (please pause for a moment to admire the pun) to our project's buildroot. (buildroot is a really handy all-in-one tool for building embedded Linux systems from scratch.)

loglinear worked like this: every time we ran a sub-make command like make path/to/project, we'd instead replace it with loglinear make path/to/project. loglinear then prefixes each line of the output with a job name, say path/to/project, and buffers it. When one of the loglinear processes exits, the top-level loglinear process then takes the buffer from that instance and dumps it to the top-level stdout.

Let's do an example. Imagine we're parallel building A, which depends on J, which depends on all of X, Y, and Z. We launch loglinear make A, which starts loglinear make J, which itself starts (all in parallel) loglinear make X, loglinear make Y, and loglinear make Z. J cannot continue until X, Y, and Z are done, but those three might finish in any order, and loglinear will print the output of each one as soon as it's done. So, the output will look something like this:

  Z: ...some stuff...
  Z: exited with code 0

  X: ...
  X: exited with code 0

  Y: ...
  Y: exited with code 0

  J: ...stuff...
  J: make X
  J: make Y
  J: make Z
  J: ...
  J: exited with code 0

  A: make J
  A: exited with code 0

loglinear also had some magic in case one of the processes returned nonzero: in that case, we'd print the successful processes first and the unsuccessful processes last, in the hope that the "most interesting" messages would end up at the bottom.

This made debugging a lot easier, because build messages from entire packages (like, say, the Linux kernel and busybox) separated out instead of interspersed, but it had some flaws. Most importantly, the output was very bursty: it waited until a given job was completely done before it printed anything. When busybox finished, you saw all the busybox logs; when the kernel finished, you saw all the kernel logs. Although useful, this is, frankly, not as fun as watching your 16-core workstation live-blast several screenfuls of compiler log messages per second. It feels slow.

There was also a secondary problem, which is that the messages, although linearized, were in the wrong order. Notice that, in the above, 'make J' (in A) happens after all the messages from J. This is because we print jobs in the order that they finish, and by definition, J must finish before the job that started J can finish. If we tried to print it in a more reasonable order (topmost job first, dependencies next, etc), then we couldn't print any incremental logs at all: A is guaranteed to finish last, but we want to print it first. This is all very logical when you think deeply about it, but trust me, it gets tedious explaining it to every new developer on your team, who just wants to know why the time is flowing backwards.

So we used loglinear, and it was a necessary evil, especially for viewing autobuilder logs, but nobody liked it. I dreamed of a better way.

Back to 2018: redo-log

I've had many years to contemplate my 2012 sins, and I have good news: I finally figured out how to do it right. Not only that, but instead of introducing a weird tool that you have to hack into your makefiles (and hack I did, oh boy did I ever, to make buildroot parallelize things the way I wanted), I've helpfully integrated the magic directly into redo. And not only that, but I've updated buildroot to use redo so that not only can you get linearized logs, but you can get faster buildroot startup time and faster/better buildroot dependencies too.

(Dear buildroot team: If you're reading this, I was going to send this patch to your mailing list, but it's not ready for prime time, or even code review, yet. I'd love to hear your feedback if you have any.)

redo-log takes a totally different approach from loglinear:

  • It saves the log for each target persistently to its own file, so you can look at it again later.
  • Rather than a flat list of log files, it tracks their tree order.
  • It prints log messages "depth first" instead of "breadth first," for less burstiness.
  • It prints output in the order dependencies were launched, instead of the order in which they were finished.
  • It can helpfully indent log messages based on their recursion level.
  • Since we persist logs anyway, we reserve the right to simply not print messages from some irrelevant targets when an error happens. You can always pull up the logs later if you care.

In other words, the logs from our earlier build now look like this:

  A: redo J
  J:   ...J stuff...
  J:   redo X
  X:     ...X stuff...
  X:     exit 0
  J:   redo Y
  Y:     ...Y stuff...
  Y:     exit 0
  J:   redo Z
  Z:     ...Z stuff...
  Z:     exit 0
  J:   ...more J stuff...
  J:   exit 0
  A: exit 0

The important realization - which is maybe obvious to you, but it wasn't obvious to me - is that, if you decide to do a depth-first traversal of log messages, the "deepest" one that is still running will continue producing incremental messages until it finishes. There's no need to buffer them!

During that time, other parallel branches of the tree will also be producing messages, which we do buffer until later. So Z might finish before X, but we just print the messages from X as they come out, until X is done. Then we go back to J, which sends us to Y, which we follow until it's done. When we get to Z, which is done already, we just print all its enqueued messages in one big blast, then move on.

An interesting invariant here is that it doesn't matter whether X, Y, or Z finishes first. If they each print their own messages (including launching their own subtasks) in a reproducible order, then no matter how the CPU schedules them, the total output will be in a reproducible order. This has the almost-impossible-sounding property that a set of "reproducible build" steps will produce a byte-for-byte reproducible log, even in the presence of unlimited parallelism.

The tricks go a little deeper. Let's say X, Y, and Z all depend on Q. Because of how .do scripts work, they will each run redo-ifchange Q at some undefined time in their respective build scripts. We only need to build Q once, but we don't know which of X, Y, or Z will be the one to do it. This is where the persistent logs come in; we don't actually care! Effectively the log is a DAG (directed acyclic graph, the same kind of structure used in git) with multiple links to Q. Its structure is like this:

  A: redo J
  J:   ...J stuff...
  J:   redo X
  X:     redo Q
  Q:       ...build Q...
  X:     ...X stuff...
  J:   redo Y
  Y:     redo Q
  Q:       ...build Q...
  Y:     ...Y stuff...
  J:   redo Z
  Z:     redo Q
  Q:       ...build Q...
  Z:     ...Z stuff...
  J:   ...more J stuff...
  A: exit 0

Of course we only ran Q once, so it's silly to print its output more than once. Let's trim it:

  A: redo J
  J:   ...stuff...
  J:   redo X
  X:     redo Q
  Q:       ...build Q...
  X:     ...X stuff...
  J:   redo Y
  Y:     redo Q
  Y:     ...Y stuff...
  J:   redo Z
  Z:     redo Q
  Z:     ...Z stuff...
  J:   ...stuff...
  A: exit 0

Because of our depth-first traversal rule, the log will always look exactly like that - even if job Q was "actually" launched by job Y and not X. redo-log prints logs in dependency order.

After the build finishes, though, you might want to investigate exactly how Z got built. To do that, you run redo-log Z, which prints this:

  Z: redo Q
  Q:   ...build Q...
  Z: ...Z stuff...
  Z: exit 0

In this case, we can show the steps for job Q as a subtree of Z, even though Q was actually built by Y, because it's not redundant when we're not printing Y.

One more complication arises if one of Z's dependencies changes and we need to rebuild Z, but Q has not changed. If we do that, then the "honest" redo log for the incremental rebuild of Z looks like this:

  Z: redo-ifchange Q  [nothing happens]
  Z: ...Z stuff...

But depending what you're doing - for example, if you want to see if the "reproducible log" for an incremental build of your whole reproducible build project matches a from-scratch build - it might make sense to show where Q came from. This is redo-log's -u option ("recurse into unchanged targets"), which then prints this:

  Z: redo Q
  Q:   ...build Q...
  Z: ...Z stuff...
  Z: exit 0

...in other words, the exact same log as you got when you built Z the first time.

Conclusion

I'm sure almost everyone reading this thinks I'm hopelessly pedantic to care so much about the sequence of lines of output in my build logs. You're right! But you're getting off easy, because you didn't have to live through my obsessing over LED blink synchronization across a lab full of wifi routers. (Useless trivia: I found at least three bugs in openntpd by noticing the LEDs in our lab were not all blinking uniformly.)

And that, my friends, is why tree traversal algorithms are fair game in job interviews.

...

Uh, also, you should try redo. You may also want to see how I redo-ized buildroot. If you're interested, you can join the discussions on the redo-list mailing list.

2018-11-13 »

mtime comparison considered harmful

tl;dr: Rebuilding a target because its mtime is older than the mtimes of its dependencies, like make does, is very error prone. redo does it better, and so can you.

A recent twitter discussion (pro tip: never do those) led me to realize that I have studied the problem of mtime comparison in considerably more depth than most people. I don't know whether to be proud of this or very concerned, but nevertheless, here we are. Soon, you'll know everything I do about the topic. I expect you will regret that as much as I have.

What is an mtime, anyway?

mtime is the "modified time" for the content associated with a given file. Generally, if anyone writes bytes anywhere in a file, the mtime will be updated. If a file has more than one name (ie. it's hardlinked to more than one place), all the names share the same inode and content, and thus all share the same mtime.

Annoyingly, when you update the content of a file, the mtime of its containing directory is not changed. All sorts of very convenient tree traversals would be possible if the directory mtime were updated (recursively to the root) when contained files changed, but no. This is probably because of hardlinks: since the kernel doesn't, in general, know all the filenames of an open file, it literally cannot update all the containing directories because it doesn't know what they are either. And anyway, purists might argue that the "content" of a directory doesn't change when the files it points to change; the content is merely a list of filenames and inode numbers, after all, and those stay the same, no matter what happens inside those inodes. Purists make me sad.

(Random side note: on MacOS, the kernel does know all the filenames of a hardlink, because hardlinks are secretly implemented as fancy symlink-like data structures. You normally don't see any symptoms of this except that hardlinks are suspiciously slow on MacOS. But in exchange for the slowness, the kernel actually can look up all filenames of a hardlink if it wants. I think this has something to do with Aliases and finding .app files even if they move around, or something.)

Related to mtime is the ctime, which most people would guess means "create time," but it absolutely does not. It means "attribute change time," which is different from the "modified time" because it updates whenever various inode fields change, not just the file contents. mtime is one of the inode fields, so whenever mtime changes, ctime also changes, but not vice versa. Among other things, ctime changes when file ownership, size, or link count change.

Link count is especially interesting: if you create or delete a hardlink to a given file, its ctime changes. Renaming is defined as creating a new hardlink and then removing another one, which means it updates the ctime (not the mtime), even though when it finishes, the link count is back to normal so the inode looks unchanged (other than the ctime). (Whether rename's create and unlink are supposed to be a single atomic transaction is a subject of much debate.)

So anyway, ctime changes much more sensitively than mtime. It turns out that mostly you don't care about the changes ctime measures, so it causes false positives, especially because of that pesky link count, but if you're paranoid, this can be helpful. Let's mostly talk about mtimes for now.

For completeness, there is also the atime, which means "access time." Originally, this would update whenever anyone "accessed" a file, usually defined as reading bytes from it. But this is unhelpful for two reasons: first, it means reading a filesystem causes writes to that filesystem, which greatly increases disk load (some people estimate by ~30%). Secondly, the definition of "access time" does not match what end users mean, which means various programs (especially backup software and search engines) try to avoid updating it. This workaround is so common that Linux added an O_NOATIME flag to open(2) to prevent updating atime. The default atime performance hit is so bad that many filesystems now have a relatime mount flag which decreases the precision of atime, thus reducing disk load. (Trivia: the Debian popularity-contest, which I started long ago, uses atime to figure out which installed packages you actually use.) (More trivia: if you mount your filesystem readonly, it is technically not POSIX compliant anymore because the atimes won't be updated.)

Popular misconceptions about mtime

  • How precise is it? It depends on the OS and filesystem. Originally, mtime had a precision of one second, which is all you can safely rely on. Nowadays most OSes have a stat(2) syscall that returns a struct timespec, which contains nanoseconds, but almost no filesystems provide that level of precision, and it depends on your kernel and disk format. For example, on my system (Debian Linux 4.9.0-7 with ext4), I get about 0.01s granularity. Stackoverflow has an explanation.

  • Is mtime monotonically increasing? No, it can go backwards. For example, the utimes(2) syscall, used by the touch command, can set the mtime to any value at all. (And tar might do this when extracting a tarball, for example.) If your system clock jumps from one time to another, it will set subsequent mtimes to match the new clock, even if the jump was backwards. And so on.

  • Does mtime get set to >= the current time? No, this depends on clock granularity. For example, gettimeofday() can return times in microseconds on my system, but ext4 rounds timestamps down to the previous ~10ms (but not exactly 10ms) increment, with the surprising result that a newly-created file is almost always created in the past:

      $ python -c "
      import os, time
      t0 = time.time()
      open('testfile', 'w').close()
      print os.stat('testfile').st_mtime - t0
      "
    
      -0.00234484672546
    
  • Does mtime get set to <= the current time? No, it might be set to a future time. For example, imagine you have an NFS server whose clock is set 5 seconds in the future relative to your client. The mtime is assigned by the server, so when you create the file, its mtime will be 5 seconds in the future. (Changing the standard so that mtime is set by the client doesn't really help: then programs running on the server will see a file 5 seconds in the past. And relying on ntpd isn't perfect either: it can only reduce clock skew between machines, not eliminate it.) For extra inconsistency, if a client uses utimes(2) to force the time to a particular value, this gets passed through to the server unchanged.

  • Is mtime always nonzero? No. Various cheaply-written virtual filesystems, like many fuse-based ones, don't bother setting mtime.

  • Does a changed mtime guarantee that a file has different content? No. Perhaps you wrote a block that happened to be identical to the block that already existed at that point in the file; the mtime changes anyway. Perhaps you wrote a block and then changed it back; the mtime changes twice.

  • Does changed content guarantee a changed mtime? No. Clock skew, low precision, or utimes(2) can cause an mtime to be the same as last time you checked. (This is also true for ctime, etc.)

  • Do version control systems like git save the mtime? No, not really. The tree and blob objects stored by git contain no timestamp information at all. (This is very good for deduplication.) commit objects contain various timestamps (commit time, author time, etc), and you could use that to reverse-engineer a guess for the mtime of a given file: the commit time of the most recent commit that changed that file's content, for example. But that's not what people do, mostly because it creates problems with make, which we'll get to shortly. (Git doesn't have the dangerous mtime-setting feature built in, but it does seem to exist in svn. You probably still shouldn't do it.)

    (This all creates interesting philosophical questions. Is the "last modified" time of a file the time when the new content itself was created, or when this particular instance of it was written to disk? If you had a sci-fi device that could make a perfect scan of my physical being and run me in a simulation, what would be the mtime of the input file? And so on.)

    (The bup project I started, which uses a git-formatted repo to back up your filesystem, does need to save mtime and other metadata. It stores metadata in separate hidden files in the git tree and reapplies it at restore time.)

  • Does switching branches in git screw up the mtime? No, not more than anything else. git just rewrites the changed files and lets the kernel update the mtime, so they look as if someone edited them with a text editor.

  • Does writing to a file via mmap() update the mtime? Hah. Well, maybe. See, POSIX guarantees that the mtime "will be marked for update at some point in the interval between a write reference to the mapped region and the next call to msync() ... If there is no such call, these fields may be marked for update at any time after a write reference." This definition actually leaves a lot of leeway for weirdness.

    I wrote a little test program (mmap_test.c) to check how this works nowadays, and, of course, it varies between OSes. On Linux (4.9.0, ext4), the mtime is updated at the first page dirty after an mmap() or msync(). On FreeBSD (11.2, ufs), it's updated at msync() or munmap() time. On MacOS (10.11.6), it updates only at msync() time, not at munmap() time. I even tried with the "WSL personality" (4.4.0-17134-Microsoft) on Windows 10, which had especially terrible results: mmaped writes never updated the mtime at all.

    I think the MacOS behaviour is allowed because the spec says "may" instead of "will" in that second sentence, but it's a stretch. The Linux behaviour may be illegal depending how you define "a write reference"; Linux seems to interpret it as "the first" or "a randomly selected" write reference, while I would expect to interpret it as "each" write reference (with the result that mtime must be updated at least once between the last reference and the msync(), which would be fine).

    Of all these, the only useful behaviour seems to be FreeBSD's; at minimum, we surely want mtime to be updated at least once after all changes to a file have been done. MacOS and Linux don't always do so, and WSL never does so. This lends credence to the claim that the .git/index file, which uses mmap, is synced incorrectly by file sync tools relying on mtime. Ironically, the faster and better the file sync tool, the more likely it is to hit the race condition. An easy fix would be to have git always write() a useless byte before closing the index file. But I'd prefer if the kernel were less dumb.

Okay! That's the introduction. Now let's move on to application.

mtimes and make

I've kinda ruined the surprise by listing the caveats above. But let's look at what all that means when we try to use mtime for something.

make dependencies work in a very simple way. Now that we, as an industry, have decades of experience learning all the above caveats, we might describe it as "naive" in the sense that, when make was first invented, nobody had heard of all these problems, so it would be unfair to expect the author to design around them. In the world where make was first written:

  • there was no NFS;
  • there was no mmap;
  • there was no version control;
  • there were no fuse filesystems;
  • computers and compilers were so slow that a one-second timestamp granularity was never a problem.

In that world, they made the seemingly obvious decision to rebuild any target if the mtime of any of its dependencies was > the mtime of the target. (If you want to be extra safe in the presence of granularity problems, rebuild if >= rather than >.) This was an exciting innovation at the time.

Unfortunately, we now know that this can lead to numerous mistakes:

  • with NFS and clock skew, if a source file is edited on one machine and you run make on another, the input file might have mtime < target mtime, so nothing will happen. Or, you might rebuild the target and its mtime will still be < source mtime, so it'll be rebuilt again later.

  • If you accidentally set your system clock ahead by a day and build some stuff, then set your clock back to the present, all the stuff you built during that time will show up as "in the future" and therefore newer than source files you edit today, preventing all rebuilds. (Eventually GNU make started detecting future-dated files and printing a warning.)

  • If you have files modified through mmap, the mtime might not be up to date. (Luckily mmap is a rarity when editing source files or building software. Usually you aren't directly using live databases as your source files.)

  • If you replace one source file with another, eg.

      mv foo.c foo.c.bak
      mv foo.c.new foo.c
    

    then the mtime is not updated, and make will see the old mtime of foo.c.new. That might be older than your foo binary, even though the binary does not yet contain the new foo.c. It won't be rebuilt.

  • If you have a dependency like

      foo.a: $(patsubst %.c,%.o,$(wildcard *.c))
    

    (ie. produce foo.a from all the .o files built from all the C source files), then if one of the source files is deleted, it will no longer be one of the dependencies at all. But all the remaining dependencies are still older than foo.a, so foo.a will not be rebuilt.

  • If you put automake/autoconf-generated files (like ./configure and Makefile) in version control, you can get surprising results. Let's say automake has a Makefile rule to regenerate Makefile whenever the automake input files (eg. Makefile.am) change. In a tarball, which preserves mtimes, this will work, because Makefile will be newer than Makefile.am. But in a version control system, which uses the default kernel-assigned mtime when writing the files, it's undefined whether Makefile or Makefile.am is written first. If your timestamps are high precision (or they're low precision and you get unlucky), then Makefile could be "older" than Makefile.am, and automake will try to run anyway. Or if not, then it won't. So different people checking out the same source code will get different results based on random luck.

  • Computers are now so fast that you can save foo.c in your editor, and then produce foo.o, and then compile foo, all in the same one-second time period. If you do this and, say, save foo.c twice in the same second (and you have one-second granularity mtimes), then make can't tell if foo.o and foo are up to date or not. (As above, make can work around this by assuming if source mtime == target mtime, the target still needs to be rebuilt. This could cause spurious rebuilds, but is less dangerous than missing rebuilds.)

    (This often happens if you're using one of those fancy new inotify-based tools that fires off a compile immediately, every time you hit save in your editor. Typescript does something like this, for example, as do auto-reloaders for various modern web languages. Symptom: needing to save your source file twice before the autocompiler catches it. And it happens more on MacOS, which has 1-second mtime granularity, than on Linux, which has 0.01-second mtimes.)

  • If your source files are in a virtual filesystem where mtime is always 0, then make will always think your source files have not changed and the target will never rebuild.

While we're here, there are some other common problems that aren't really the fault of mtime, but are common dependency problems with make:

  • If you upgrade your toolchain (eg. your C compiler), make doesn't know to rebuild your source files, unless you declare an explicit dependency on the toolchain files, which nobody does because it's hard to write that system-dependent stuff as a Makefile dependency rule. (This is one reason autoconf needs to be a ./configure script that generates a Makefile, instead of just a dependency executed by your Makefile.)

    For that matter, when you update your toolchain, it's often from a distro-provided package (basically a tarball) with timestamps helpfully in the past, which are probably older than all your output files. So make won't see it as updated anyway!

  • If you pass variables on the make command line, like CFLAGS=-O2, they will usually not be part of a dependency and so won't cause a rebuild, and you'll end up with programs built halfway with the old flags, and halfway with the new ones. You can fix this by writing CFLAGS to a file, atomically replacing it only if the content differs, and depending on that file. But nobody does.

  • If you modify the Makefile, make will not by default rebuild any targets. You can fix this by adding an explicit dependency on Makefile, but this is a giant pain during development, because Makefile contains all your build rules; you don't want to recompile every source file just because you changed the linker command line, for example. (Some nowadays-rare versions of make actually tried to track Makefile changes, per rule, and cause rebuilds for these cases.)

make is not the only program that is affected by naive use of mtime. It's fairly common. For example, Go had so much trouble that they recently changed the Go compiler to just read and hash all the input files every time it runs. (Thanks to bradfitz for this link.)

redo: mtime dependencies done right

I happened to be aware of all these problems (well, not the mmap() madness; bleah!) when I set out to write redo so many years ago. I was also influenced by djb's design for redo, in which he writes, "When redo is asked to create a file that it hasn't heard of before, it presumes that the file is a source file if it exists, or a target file otherwise. In the second case (new target), redo immediately saves this decision to disk."

In other words, redo's design fundamentally depends on keeping a database of targets, if only to remember which files were produced by redo and which were not. From there, it's easy enough to extend that database to include mtime information about sources. And from there, we can add a bit more metadata to make the timestamp even more reliable.

My implementation of redo remembers the following information about each source and target:

  • mtime
  • size
  • inode number
  • file mode
  • owner uid and gid
  • (targets only) the sequence number of the last time it was built

redo considers a dependency dirty if any of those attributes changed since the last time a target was built. Notice how this dodges the various problems of mtime skew:

  • NFS client/server time skew doesn't matter; as long as the mtime changes in any direction, it's fine.

  • mmap() weirdness is reduced, because we notice changes in file size, as well as source mtimes that changed but are still older than the target.

  • If you mv a file to replace another, it will have a different inode number, which we notice. It also probably has a different size and (even if not newer than the target) mtime, any of which are sufficient.

  • Because redo has a database of all the dependencies used to produce a given target, if one of those inputs disappears, the target needs to be rebuilt. make doesn't remember the dependencies used last time, it only remembers the dependencies declared this time, so it can miss important changes in the list of dependencies.

    (More generally, it's an interesting mathematical phenomenon that to correctly build software, we need to know not only the dependencies as they are now, but as they were before. Those two lists are used very differently. I don't think most build systems are designed with this realization, and it leads to subtle failures.)

  • If you put autoconf/automake generated files in your source repo, redo will "presume that the file is a source file," make a note of that, and not rebuild it. (It's still probably not a great idea to check those into version control. But at least now your build system won't go crazy.) If you then delete them, redo will consider them targets to be built.

  • redo has special treatment of source files whose mtime == the target mtime, so it can correct for overlaps even when your filesystem has very coarse timestamp granularity. Also, if you continue editing a source file, it will usually end up with a changed size, which also marks it as changed.

  • If your source files are in a braindead fuse filesystem, redo can use inode number and size to detect changes (although it still sucks and you should fix your fuse filesystem).

We can also fix the non-mtime-related missing dependencies:

  • It's easy to declare dependencies on your toolchain, because the rule for each target can track which parts of the toolchain were used while building, then retroactively declare a dependency on those. And we still notice a change if the new mtimes are in the past.

  • redo doesn't allow you to set variables on the command line; you have to write them to a file instead. This lets you easily declare dependencies on the file.

  • Since rules are written in separate .do files instead of one big Makefile, it's reasonable for redo to auto-declare a dependency on the .do file it used for a given target. When you edit a rule, the affected targets are automatically rebuilt.

I mentioned above that the Go compiler had problems with naive mtime-based dependency checking. I don't expect Go to switch to redo, but they could solve their problems in a similar way: generate a "database" (which might just be a text file) at build time. In the database, list the source files and their stamp information (mtime, inode, etc). Also list the toolchain version and relevant command line flags. Next time, read the database and compare against the new list of source files, the new stamps, and the new flags. If any are different, run the build. (Of course, all this is just a performance optimization that allows the compiler to avoid opening and reading files unnecessarily. The Go developers might reasonably continue to opt for the slower choice with fewer edge cases.)

Why not use checksums instead of mtimes?

Inevitably when the discussion of build dependencies comes up, someone who has heard part of the above story (usually some of the make problems caused by mtime comparisons) suggests throwing away mtimes entirely and always doing dependencies based on file checksums.

This can work, sometimes. And wow, I love checksums a lot (I wrote bup after all). But it isn't perfect for every situation.

As a clue to how complicated this can get: most people talking about this option suggest checksums as a way to avoid false negatives, ie., failing to rebuild when a source file has changed. But inode attributes change, in theory, at least as often as the content hash changes. Checksums are more useful for reducing false positives (ie. to avoid rebuilding in situations where we know the output will be identical). If someone is talking to you about rebuilding based on checksums, ask if they have thought about that difference.

Anyway, here are some specific problems with checksum-based dependencies:

  • Sometimes building a target has side effects. For example, imagine you have a redo rule for deploying a container to AWS. This does not really produce a "file" locally that you can checksum; it usually produces just log messages, or blank output, and the checksum of that will usually not change. Now, imagine you have a second container that you want to deploy only if the first container gets deployed correctly. If the checksum of the first container deployment is unchanged, the second one will think all its dependencies are unchanged, and not run, which might be incorrect. There are numerous other examples of side effects where this always-use-checksums behaviour is undesirable.

    (On the other hand, some systems out there, like blaze/bazel, specialize in build systems without side effects. In that case a pure-checksum system is more appropriate. But then you have to escape from such systems if you want to do fun stuff like deploying containers. You end up punting the dependency problem elsewhere.)

  • Checksumming every output after building it is somewhat slow. This requires the build system to read the whole content of the file and do some math on it. Mostly this is not too serious: the file is probably already in disk cache (since you just wrote it a moment ago!) and calculating a checksum is almost always much faster than generating the file in the first place. And it only happens when a build was needed, which is expensive anyway. But it does add time to every build step.

  • Checksumming every input file before building is very slow. If you're considering whether to rebuild foo.a, and foo.a depends on *.o, and each.o depends on each.c, then you have to checksum the full content of every .c file every time you consider making an incremental build of foo. In large projects, this could be thousands, or tens of thousands of files, each of which we have to open(), read(), checksum, and close(), possibly over a network filesystem. For small projects this is fine, but for large projects, this sucks a lot.

    blaze/bazel come from a world where source files are stored in a virtual filesystem, which happens to have the ability to tell you a precalculated checksum for every source file (except the ones you've changed locally). If you only have to checksum your locally-changed files on each build, that'll be very fast. But you need filesystem support to make this possible, and we can't assume that everywhere.

redo does support checksum-based dependencies, but it avoids the above problems as much as possible:

  • If you do nothing, redo uses database-mtime-based dependency checking, which is extremely fast on all operating systems. It's even reasonably fast on NFS.

  • redo-stamp lets you provide, after building a target, the data used to calculate that target's checksum (which might differ from the target itself, if you want).

  • redo-stamp records the checksum in its database after building a target. Any downstream target remembers that checksum in its list of dependencies; if it changes later, then the downstream target needs to be rebuilt. There is no need to actually recalculate any checksums when checking dependencies in the future. No special filesystem support is needed.

So you can use redo-stamp, in appropriate places, to reduce false positives in a way that causes overhead only at build time (not for checking dependencies later), and only for targets that need it.

That mmap() behaviour though. Seriously.

2018-11-17 »

The good feeling from bad feedback

I'm working on (statistically speaking) my least interesting work in years.

Previously I spent time doing wifi drivers, boot scripts, logfile processors, payment systems, and project estimation, so one might reasonably have assumed I can't get much more boring. But hah! We have standards to exceed.

Here's what's weird though: I'm enjoying it.

There's this Feeling I get, very rarely, when I'm sure I'm on the right track. Over the last several years, I've almost had the Feeling occasionally, but not very often. It's been so long since I felt it that I actually forgot what it feels like. Talking to software people, I strongly suspect some have never felt it at all.

A common question, when recent graduates ask me for advice after landing their dream job, is of the form, "Is it... always like this? When I started to learn coding, I really liked it. I kind of assumed work would be... better." What they mean isn't that coding sucks, but that their project is unfulfilling. They suspect the Feeling exists. Maybe they remember having it as a kid. They thought they were almost there again. Life is all lined up: great school, great grades, great employer at a great salary, great co-workers. And then... nothing. Is it always like... this?

No, it's not. Not always.

But a lot.

...

Yesterday, someone on a mailing list told me (politely)1 that my documentation sucked. A few days before, someone called my code spaghetti. And the thing is... YES. Yes, I am a pretty good writer, but that documentation is not my best work. That code is some of the best I've written, but it's also the most incomprehensible, and I've known for years how to make it better, but I haven't gotten around to it. It's refreshing to have a project with no stakes, with volunteers who care about elegance, where I can hear stuff like that. I live on that. It's what makes me go, and when I finally do get it right, it means the positive feedback is real.

It's been years since someone told me my work sucked. And that was one thing, I can live with that, it's West Coast People, but in the last year or so, the compliments have been egregious. People's acceptance of my work and my opinions had a lot more to do with my reputation (among a certain small group) than about my quality. It's easy to get addicted to that, to let it take the place of the Feeling, but in the end it's just candy. The sugar high fades, and too soon, you need more sugar.

That's why I had to change gears. The withdrawal process has been a bit painful, but it's worth it. Maybe my project won't amount to anything. But if it dies, at least it'll die beautiful.

Footnote

1 Just to be clear, constructive negative feedback is important, but being a jerk is not. There's no need to go on a swearing/ranting angry rampage when things go wrong. It doesn't help anyway. But pretending bad things are good things doesn't help either.

2018-11-25 »

Xnest, Xephyr, ChromeOS, synergy, and syncing some clipboards

I recently decided to switch my laptop from a Macbook to a Chromebook, partly because Apple's keyboards are so terrible lately, and partly because ChromeOS is suddenly useful now that they invented Crostini.

(Some people ask why I, a person who actually knows how to use Linux and has debugged wifi drivers and XF86Config files, would want to use a locked-down desktop Linux variant instead of just installing Debian or something. And I do install Debian, on desktop hardware. But on a laptop, hardware support is paramount: external monitors (eg. for presentations), bluetooth audio (for music while travelling), long battery life, and rapid, non-crashy suspend/resume, are all really important to me. ChromeOS actually does all that stuff reliably nowadays, because they design the OS and the hardware at the same time. Debian can't compete with that.)

One showstopper for me when I'm trying to do software development, however, is having a proper window manager, which is to say, one that I can run without resorting to a mouse or touchpad. Because I am old and crusty and unreasonably opinionated, the one I want to run is ion1. I had it working on MacOS, but I wanted it on ChromeOS.

Now, modern ChromeOS uses Wayland as its display manager, not X11, which is of some concern becuse ion1 stopped evolving more than a decade ago (which is how I like it) and therefore only understands X11. Also, ChromeOS provides a Wayland compositor and doesn't let you replace it from Crostini, even though they do let you securely launch Wayland and X11 windows from Crostini (which is pretty cool).

Do we give up? No! The "obvious" "solution" is Xnest, an "X proxy" from before the dawn of time, which puts all your windows inside one big window. So I created one big full-screen Xnest window (managed by Wayland), and then ran ion1 and a bunch of rxvt terminals inside.

This actually worked almost right, except: ion1 doesn't support these fancypants client-rendered fonts. No. It's old and crusty, like me. It expects the X server to render its fonts. And unfortunately, ChromeOS contains only about four fonts in its X server, all of which are hopelessly microscopic on the 200dpi screen in my Chromebook. Oops.

Luckily, about 11 years ago, slightly after the dawn of time, someone else didn't like some Xnest limitations and made Xephyr, which apparently is more of a framebuffer and less of a proxy, the upshot of which is that it renders its own "server side" fonts (which from Wayland's point of view are on the client side, but from ion1's point of view are definitely on the server side). As a bonus, thanks to xrandr, it understands the idea of having its window resized, so I can use the handy ChromeOS "full screen" key and have it do something nice. Or drag it to an external monitor and get decent results.

I didn't try to do anything with 3D, but nominally Xephyr can do that too. But Crostini supposedly can't. I don't know and I don't really care, I'm just trying to run some terminals here.

Xephyr and its DPI calculation

One weird problem I've had with Xephyr is that whatever it's doing to calculate "dots per inch" (as reported by xdpyinfo) is completely insane. It starts off with a value that is definitely not the same as its host display, and then if you resize the window, it just changes the value and gets more and more confused. This is bad for people who want to specify their font size in points so that fonts will be roughly the same size no matter what size the display is.

As far as I can tell, this is just a bug in Xephyr. But if anybody knows what's going on or (especially) how to fix it, I'd love to know. Meanwhile, I learned to specify my font sizes in pixels instead of points.

Cut and paste

An even more annoying problem, which deserves its own section, is the question of how to deal with cut-and-paste between my Xephyr+ion1+rxvt session and the toplevel Wayland session. This is needed for two reasons:

  1. I want to copy URLs and text between my web browser and my terminals.

  2. I want to be able to run multiple Xephyr sessions and share text between them.

By default, Xephyr appears to have no clipboard sync at all between its internal clipboard and its host server's clipboard. That's no fun. (To their credit, ChromeOS does seem to manage to sync the clipboard between Wayland and its toplevel XWayland session, which is essential if we want anything to work. It's just Xnest and Xephyr that break the chain.)

Now, there are a few things you should know about X11 clipboards. The canonical explanation is jwz's X Selections, Cut Buffers, and Kill Rings, which is quite excellent and gives some background on how it's not the X11 clipboard that's crazy, it's all the apps using it.

So anyway, with that background in mind, all we need to do is magically keep the clipboard in sync between :0 (the toplevel XWayland server, which is synced with Wayland and Chrome), and :1 (inside my Xephyr server), and ideally :2 .. :n (inside other nested Xephyr servers). How hard can it be?

Well, apparently it can be hard. The best answers I could find on the Internet (which I won't link to, because they suck) are:

  1. Run a script that uses xclip to periodically grab the clipboard content from each server. If one server has different clipboard content than the currently-expected content, then copy it to the other server. This method has a few problems: first, you have to choose a periodicity for the sync process, which is inevitably either annoyingly long or battery-killingly short. Second, the "content based" sync decision is rather error prone and results in potentially unstable race conditions, especially with multi-way sync. And third, typical implementations are a little too pushy about copying the clipboard data to all screens: jwz's article talks about this problem in refence to the "X cut buffer" support before the new-style support was added. If you highlight/copy text frequently or in large volumes, it's pretty wasteful to copy it to other screens before it's needed for pasting.

  2. Run synergy, a tool that lets you seamlessly extend your mouse/keyboard/clipboard across multiple displays on multiple computers. This was very tempting, despite being severe overkill (I don't want to extend my mouse and keyboard, just my clipboard). Unfortunately, it didn't work. It almost worked. But it didn't.

Luckily(?) for you, I spent quite some time diagnosing why synergy didn't work for me in my use case. The symptom was that it would sync the clipboard in only one direction (say A->B), and only the first time I copied something. If I copied another thing on A, the clipboard on B would not be updated. To make it update, I had to copy something on B (which always fails to sync to A), and then copy something on A (which would work).

How hard can it be? I thought to myself, again, foolishly, and decided to read the source code.

Now, the synergy source code is actually pretty good. It has a nice abstraction layer for the various clipboard types in X, MacOS, and Windows. It's pretty easy to follow. It has a bit too few debug trace messages, but okay, those are easy enough to add as we go.

Unfortunately, synergy's clipboard support has two fatal design flaws:

  1. Like the periodic xclip case above, it grabs a copy of clipboard data right away when the clipboard ownership changes. It's better than a naive xclip script, because it actually gets a notification when the clipboard ownership changes, rather than polling periodically. Unfortunately, those notifications are also its downfall. See, in X11, there is only a clipboard notification when the clipboard owner changes, not when the content changes. If I copy text from rxvt, it will grab the clipboard. Synergy will notice this and read the clipboard. But if I then copy different text in rxvt, the owner doesn't change, so there is no notification, so Synergy doesn't re-copy it. That explains why it only worked the first time. (It also explains why copying on B and then on A causes it to work again exactly once: the clipboard ownership changes.) (This bug may not be visible on all terminals. If rxvt would give up the clipboard, then take it back, every time I made a copy, it would work around this problem.) (I think VNC's clipboard sync has/had the same problem.)

  2. Synergy, because it's mostly a keyboard/mouse sharing app, maintains the concept of a "current screen." That is, it watches which screen currently has the mouse pointer, and only replicates the clipboard to that screen. This is a performance optimization: since it (like the poorly designed "x cut buffer" mentioned by jwz) takes a copy every time the clipboard changes, it doesn't want to replicate this to screens where you're not using it. Unfortunately, since my screens are nested, I had to disable the keyboard/mouse sharing feature, which also leaves the "current screen" incorrect exactly half the time, which is why the clipboard fails to replicate from B->A and only works from A->B.

I was willing to try to fix some minor clipboard bugs in synergy, but I gave up when I realized this design (grab and replicate the content as soon as clipboard owner changes) was never going to work well with rxvt. That's when I gave up and decided to write my own trivial clipboard syncing tool, based on all the otherwise-useless trivia I had acquired while investigating the above.

The result is xclipsync, and, other than omitting non-text clipboard formats, I think I did it right.

  1. It starts up by taking ownership of the clipboard on display A.
  2. When it loses clipboard ownership on A, it takes ownership of the clipboard on display B.
  3. When it loses clipboard ownership on B, it goes back to step 1.
  4. When it receives a request for clipboard contents (which should be someone requesting a paste), it then reads the clipboard content from the display that it doesn't currently own, and forwards it along.

And that's it!

This avoids the problem of a single owner changing their clipboard content (since it grabs content only on demand). It doesn't do extra work if you copy content without pasting. It actually does nothing at all if you do a lot of work on one display: it loses the clipboard content on that display, which means it does nothing at all until you use the clipboard on the other display. It doesn't ever poll anything, so there are no arbitrary delays or race conditions.

And best of all, this algorithm works even for multi-way sync. You can run parallel instances of xclipsync between any two displays, and as long as you don't create any bridging loops, it will do the right thing across all of them. That is, exactly one display will "own" the clipboard, and all the other ones will copy from it. This works because every time you copy something from a new display, exactly one xclipsync instance will lose ownership, which causes it to assert ownership on exactly one display. If another xclipsync is syncing with that display, it will then lose ownership, and assert ownership on exactly one other display, and so on. As long as there are no loops, this process will terminate, and it'll do so very efficiently.

Things that could be better

There's no particular reason xclipsync can't support non-plaintext clip formats. I just didn't implement it, because I didn't need anything but text, since my Xephyr session is just terminals anyway.

xclipsync currently uses tcl/tk to take clipboard ownership (yes!). This would have been unnecessary if xclip had just one more feature: the ability to run a command at paste time, rather than always reading clipboard content from stdin at startup time. Then xclipsync would have been just a couple of (foregrounded) alternating xclip calls in a loop.

Xephyr probably should just implement this exact clipboard syncing protocol internally.

Note that it appears ChromeOS+Wayland is actually implementing some other kind of clipboard sync between Wayland and the toplevel XWayland server. When xclipsync tries to take ownership of the XWayland clipboard, it immediately experiences one "paste" operation and then loses ownership. This might be related to WAyland's inter-process security isolation features. In any case, xclipsync reacts as usual (giving clipboard ownership to XWayland and proxying requests to XWayland from other displays that want to paste) and all is well.

I really wish Alt-Tab would work even when the Xephyr instance is fullscreened in ChromeOS. I understand why they want to let me capture Alt-Tab in my full-screen X apps, but also... I don't want to.

The marketing-driven "Assistant key" on the Pixelbook is a user-hostile disaster in its current form. On the other hand, if they would let me remap it to, say, Meta, it would instantly redeem itself.

September 2018
December 2018

I'm CEO at Tailscale, where we make network problems disappear.

Why would you follow me on twitter? Use RSS.

apenwarr on gmail.com