redo, buildroot, and serializing parallel logs

redo, buildroot, and serializing parallel logs

Lately I've been getting back to hacking on my djb redo implementation. We've fixed some problems with file attribute handling on NFS, obscure locking, and MacOS/FreeBSD portability. If you haven't tried redo in a while, you might want to give it another shot.

In case you haven't heard of redo before, here's the overview: it's like make, but with no special syntax (just sh scripts). The first time you "do" a build, it runs a set of recursive sh scripts, once per target. Those scripts can run a command called redo-ifchange, which declares dependencies on the given targets, checks if they are up to date, and if not, recurses into more scripts in order to build them. And that's it!

redo combines the best parts of imperative systems with the best parts of functional systems. The build scripts are all imperative - it just runs commands, and declaring dependencies happens as a side effect of some of those commands (redo-ifchange). When it's done, you have a purely-functional data structure that you can use for extremely fast dependency calculation. (Theoretically as fast as ninja, but my code isn't as optimized.)

[Credit: redo was invented by Daniel J. Bernstein. I merely implemented it.]

Parallelism

Things get a little more complex on modern multicore computers, where you almost always want parallel builds, which means producing different parts of the tree all at once, so in principle, a sequential-imperative tree of sh scripts is no longer the perfect model. Luckily, redo can handle it: if your script does redo-ifchange on more than one target at a time, it'll try to build all those in parallel. Then, if more than one parallel target tries to build a given dependency, it uses inter-process file locking to make sure the dependency only builds once.

But what's this about serializing logs?

I'm sure you already have your favourite build system and it builds things, and it almost certainly handles parallelism for whatever your use case. Even make does parallelism.

Where things tend to fall down is in rendering the output of a parallel build. When we're running a lot of jobs all at once, and blasting them all to stdout/stderr, and one step deep in the tree gets an error, then you might get pages and pages of successful output from other tasks interspersed with your error, making it very hard to figure out what went wrong and why.

There are various approaches to solving that. Some people would argue that the Unix Way is for programs that didn't fail to just print nothing at all; that's how the Go compiler works, for example. If you like that philosophy but you're using tools that don't agree (such as make itself, which prints all kinds of stuff while it works), you could wrap every command in a script that withholds its output, printing it only if the command returns a nonzero exit code.

That's all nice until you have to debug what went wrong. It's not a coincidence that make, which is made by Unix people, does not follow the Unix Way. Makefiles are just too complicated and hard to debug if you can't see what they're doing; and if step 10 goes wrong, you might be very curious about step 9, even though (nominally) it worked. It's not okay to throw away the successful log messages from step 9.

Fine. Parallel make output is flawed and gross. But everyone knows make is flawed and gross, so they switch to other systems. Most other popular build systems are tool-specific. Someone did a lot of work in cmake, for example, to make it print pretty messages during parallel builds of C/C++ programs. That works well. But if you're not building C/C++ programs, it can't help.

redo is a general purpose dependency system like make, so by definition it's going to run scripts which produce a lot of clutter, possibly including instances of make itself, and someone is going to have to debug them. What can we do to sanitize the logs?

Digression from 2012: loglinear

I've actually been thinking about this problem for more than six years already. Back in 2012, I added a log sanitizer script called loglinear (please pause for a moment to admire the pun) to our project's buildroot. (buildroot is a really handy all-in-one tool for building embedded Linux systems from scratch.)

loglinear worked like this: every time we ran a sub-make command like make path/to/project, we'd instead replace it with loglinear make path/to/project. loglinear then prefixes each line of the output with a job name, say path/to/project, and buffers it. When one of the loglinear processes exits, the top-level loglinear process then takes the buffer from that instance and dumps it to the top-level stdout.

Let's do an example. Imagine we're parallel building A, which depends on J, which depends on all of X, Y, and Z. We launch loglinear make A, which starts loglinear make J, which itself starts (all in parallel) loglinear make X, loglinear make Y, and loglinear make Z. J cannot continue until X, Y, and Z are done, but those three might finish in any order, and loglinear will print the output of each one as soon as it's done. So, the output will look something like this:

  Z: ...some stuff...
  Z: exited with code 0

  X: ...
  X: exited with code 0

  Y: ...
  Y: exited with code 0

  J: ...stuff...
  J: make X
  J: make Y
  J: make Z
  J: ...
  J: exited with code 0

  A: make J
  A: exited with code 0

loglinear also had some magic in case one of the processes returned nonzero: in that case, we'd print the successful processes first and the unsuccessful processes last, in the hope that the "most interesting" messages would end up at the bottom.

This made debugging a lot easier, because build messages from entire packages (like, say, the Linux kernel and busybox) separated out instead of interspersed, but it had some flaws. Most importantly, the output was very bursty: it waited until a given job was completely done before it printed anything. When busybox finished, you saw all the busybox logs; when the kernel finished, you saw all the kernel logs. Although useful, this is, frankly, not as fun as watching your 16-core workstation live-blast several screenfuls of compiler log messages per second. It feels slow.

There was also a secondary problem, which is that the messages, although linearized, were in the wrong order. Notice that, in the above, 'make J' (in A) happens after all the messages from J. This is because we print jobs in the order that they finish, and by definition, J must finish before the job that started J can finish. If we tried to print it in a more reasonable order (topmost job first, dependencies next, etc), then we couldn't print any incremental logs at all: A is guaranteed to finish last, but we want to print it first. This is all very logical when you think deeply about it, but trust me, it gets tedious explaining it to every new developer on your team, who just wants to know why the time is flowing backwards.

So we used loglinear, and it was a necessary evil, especially for viewing autobuilder logs, but nobody liked it. I dreamed of a better way.

Back to 2018: redo-log

I've had many years to contemplate my 2012 sins, and I have good news: I finally figured out how to do it right. Not only that, but instead of introducing a weird tool that you have to hack into your makefiles (and hack I did, oh boy did I ever, to make buildroot parallelize things the way I wanted), I've helpfully integrated the magic directly into redo. And not only that, but I've updated buildroot to use redo so that not only can you get linearized logs, but you can get faster buildroot startup time and faster/better buildroot dependencies too.

(Dear buildroot team: If you're reading this, I was going to send this patch to your mailing list, but it's not ready for prime time, or even code review, yet. I'd love to hear your feedback if you have any.)

redo-log takes a totally different approach from loglinear:

It saves the log for each target persistently to its own file, so you can look at it again later.
Rather than a flat list of log files, it tracks their tree order.
It prints log messages "depth first" instead of "breadth first," for less burstiness.
It prints output in the order dependencies were launched, instead of the order in which they were finished.
It can helpfully indent log messages based on their recursion level.
Since we persist logs anyway, we reserve the right to simply not print messages from some irrelevant targets when an error happens. You can always pull up the logs later if you care.

In other words, the logs from our earlier build now look like this:

  A: redo J
  J:   ...J stuff...
  J:   redo X
  X:     ...X stuff...
  X:     exit 0
  J:   redo Y
  Y:     ...Y stuff...
  Y:     exit 0
  J:   redo Z
  Z:     ...Z stuff...
  Z:     exit 0
  J:   ...more J stuff...
  J:   exit 0
  A: exit 0

The important realization - which is maybe obvious to you, but it wasn't obvious to me - is that, if you decide to do a depth-first traversal of log messages, the "deepest" one that is still running will continue producing incremental messages until it finishes. There's no need to buffer them!

During that time, other parallel branches of the tree will also be producing messages, which we do buffer until later. So Z might finish before X, but we just print the messages from X as they come out, until X is done. Then we go back to J, which sends us to Y, which we follow until it's done. When we get to Z, which is done already, we just print all its enqueued messages in one big blast, then move on.

An interesting invariant here is that it doesn't matter whether X, Y, or Z finishes first. If they each print their own messages (including launching their own subtasks) in a reproducible order, then no matter how the CPU schedules them, the total output will be in a reproducible order. This has the almost-impossible-sounding property that a set of "reproducible build" steps will produce a byte-for-byte reproducible log, even in the presence of unlimited parallelism.

The tricks go a little deeper. Let's say X, Y, and Z all depend on Q. Because of how .do scripts work, they will each run redo-ifchange Q at some undefined time in their respective build scripts. We only need to build Q once, but we don't know which of X, Y, or Z will be the one to do it. This is where the persistent logs come in; we don't actually care! Effectively the log is a DAG (directed acyclic graph, the same kind of structure used in git) with multiple links to Q. Its structure is like this:

  A: redo J
  J:   ...J stuff...
  J:   redo X
  X:     redo Q
  Q:       ...build Q...
  X:     ...X stuff...
  J:   redo Y
  Y:     redo Q
  Q:       ...build Q...
  Y:     ...Y stuff...
  J:   redo Z
  Z:     redo Q
  Q:       ...build Q...
  Z:     ...Z stuff...
  J:   ...more J stuff...
  A: exit 0

Of course we only ran Q once, so it's silly to print its output more than once. Let's trim it:

  A: redo J
  J:   ...stuff...
  J:   redo X
  X:     redo Q
  Q:       ...build Q...
  X:     ...X stuff...
  J:   redo Y
  Y:     redo Q
  Y:     ...Y stuff...
  J:   redo Z
  Z:     redo Q
  Z:     ...Z stuff...
  J:   ...stuff...
  A: exit 0

Because of our depth-first traversal rule, the log will always look exactly like that - even if job Q was "actually" launched by job Y and not X. redo-log prints logs in dependency order.

After the build finishes, though, you might want to investigate exactly how Z got built. To do that, you run redo-log Z, which prints this:

  Z: redo Q
  Q:   ...build Q...
  Z: ...Z stuff...
  Z: exit 0

In this case, we can show the steps for job Q as a subtree of Z, even though Q was actually built by Y, because it's not redundant when we're not printing Y.

One more complication arises if one of Z's dependencies changes and we need to rebuild Z, but Q has not changed. If we do that, then the "honest" redo log for the incremental rebuild of Z looks like this:

  Z: redo-ifchange Q  [nothing happens]
  Z: ...Z stuff...

But depending what you're doing - for example, if you want to see if the "reproducible log" for an incremental build of your whole reproducible build project matches a from-scratch build - it might make sense to show where Q came from. This is redo-log's -u option ("recurse into unchanged targets"), which then prints this:

  Z: redo Q
  Q:   ...build Q...
  Z: ...Z stuff...
  Z: exit 0

...in other words, the exact same log as you got when you built Z the first time.

Conclusion

I'm sure almost everyone reading this thinks I'm hopelessly pedantic to care so much about the sequence of lines of output in my build logs. You're right! But you're getting off easy, because you didn't have to live through my obsessing over LED blink synchronization across a lab full of wifi routers. (Useless trivia: I found at least three bugs in openntpd by noticing the LEDs in our lab were not all blinking uniformly.)

And that, my friends, is why tree traversal algorithms are fair game in job interviews.

...

Uh, also, you should try redo. You may also want to see how I redo-ized buildroot. If you're interested, you can join the discussions on the redo-list mailing list.

2018-11-06 »