Misadventures in process containment

Misadventures in process containment

I've been working on a series of tutorials using redo for various use cases. (One of the most common user requests is more examples of how to solve real-world problems with redo. The problem with extremely flexible tools is that it can be hard for people to figure out how to start.)

The most recent in the series is a a tutorial on building docker and kvm containers from scratch using redo. I think it turned out pretty well. Maybe the tutorial deserves a disclaimer, though: is this really something you should do?

Good question. I don't know.

You see, there are a few "standard" ways to build containers nowadays, most commonly using Dockerfiles and the docker build command. It's not at all obvious that we need another way to do it. Then again, it's not obvious that we don't.

Wait, let's back up a bit. Where are we and how did we get here?

The idea of isolated chroot-based containers has been around for a very long time. In my first startup, we even had a commercial version of the concept as early as 2005, first implemented by Patrick Patterson, called Nitix Virtual Server (NVS). The original idea was to take our very-locked-down Linux-based server appliance and let you install apps on it, without removing the (very useful for security and reliability) locked-down-ness of the operating system. Nitix was a very stripped down, minimal Linux install, but NVS was a full install of a CentOS-based system in userspace, so you could do basically anything that Linux could do. (You might recognize the same concepts in ChromeOS's Crostini.) We could start, stop, and snapshot the "inner" NVS operating system and make backups. And most importantly, if you were an app developer, you could take one of these pre-installed and customized snapshots, package it up, and sell it to your customers. Hopefully with one of our appliances!

Eventually one of these packaged apps became appealing enough that the app maker, who was much larger than us, decided to acquire our company, then (as often happens) accidentally killed it with love, and that was sadly the end of that branch of the container evolutionary tree.

Our containers were installed on physical appliance hardware - another branch of evolution that seems to have died. Nobody believes anymore that you can have a zero-maintenance, self configuring, virus free appliance server that runs on your office network without needing a professional sysadmin. Come to think of it, most people didn't believe us back then either, at least not until after seeing a demo. But whatever, the product doesn't exist anymore, so nowadays they're right.

In any case, the modern solution to this is for everybody to host everything in the Cloud. The Cloud has its own problems, but at least those problems are fairly well understood, and most importantly, you can pay by the minute for a crack team of experts, the ones who own the servers, to fix problems for you. For most people, this works pretty well.

But back to containers. The way we made them, long ago, was a bit ad-hoc: a person installed a fresh NVS, then installed the app, then wrote a few scripts to take system configuration data (user accounts, remember when we used those? and hostnames, IP addresses, and so on) and put them in the right app-specific config files. Then they'd grab a snapshot of the whole NVS and distribute it. Making new versions of an app container involved either making additional tweaks to the live image (a little risky) and re-snapshotting, or having a human start over from scratch and re-run all the manual steps.

Those were simpler times.

Nowadays, people care a lot more about automated builds and automated testing than they did back in 2005, and this is a big improvement. They also collaborate a lot more. Docker containers share almost the same basic concepts: take a base filesystem, do some stuff to it, take another snapshot, share the snapshot. But it's more automated, and there are better ways to automate the "do some stuff" part. And each step is a "layer", and you can share your layers, so that one person can publish a base OS install, another person can install the Go compiler, another person can build and install their app, and another person can customize the app configuration, and all those people can work at different companies or live in different countries.

As a sign of how out of touch I am with the young'uns, I would never have thought you could trust a random unauthenticated person on the Internet to provide a big binary image for the OS platform your company uses to distribute its app. And maybe you can't. But people do, and surprisingly it almost never results in a horrible, widespread security exploit. I guess most people are surprisingly non-evil.

Anyway, one thing that bothered me a lot, both in the old NVS days and with today's Dockerfiles, was all the extra crud that ends up in your image when you install it this way. It's a whole operating system! For example, I looked at what might be the official Dockerfile for building a MySQL server image (although I'm not sure how one defines "official") and it involves installing a whole C++ compiler toolchain, then copying in the source code and building and installing the binary. The resulting end-user container still has all that stuff in it, soaking up disk space and download time and potentially adding security holes and definitely adding a complete lack of auditability.

I realize nobody cares. I care, though, because I'm weird and I care about boring things, and then I write about them.

Anyway, there are at least two other ways to do it. One way endorsed by Dockerfiles is to "skip" intermediate layers: after building your package, uninstall the compiler and extra crap, and then don't package the "install compiler" and "install source code" layers at all. Just package one layer for the basic operating system, and one more for all the diffs between the operating system and your final product. I call this the "blacklist" approach: you're explicitly excluding things you don't want from your image. Dockerfiles make this approach relatively easy, once you get the hang of it.

A more obsessive approach is a "whitelist": only include the exact files you want to include. The trick here is to first construct the things you want in your final container, and then at the end, to copy only the interesting things into a new, fresh, empty container. Docker doesn't really make this harder than anything else, but Docker doesn't really help here, either. The problem is that Docker fundamentally runs on the concept of "dive into the container, execute some commands, make a snapshot" and we don't even have a container to start with.

So that's the direction I went with my redo tutorial; I built some scripts that actually construct a complete, multi-layered container image without using Docker at all. (To Docker's credit, this is pretty easy, because their container format is simple and pretty well-defined.) Using those scripts, it's easy to just copy some files into a subdirectory, poke around to add in the right supporting files (like libc), and then package it up into a file that can be loaded into docker and executed. As bonus, we can do all this without being the 'root' user, having any docker permissions, or worrying about cluttering the local docker container cache.

I like it, because I'm that kind of person. And it was a fun exercise. But I'm probably living in the past; nobody cares anymore if it takes a few gigabytes to distribute an app that should be a few megabytes at most. Gigabytes are cheap.

Side note: incremental image downloads

While we're here, I would like to complain about how people distribute incremental changes to containers. Basically, the idea is to build your containers in layers, so that most of the time, you're only replacing the topmost layers (ie. your app binaries) and not the bottommost layers (ie. the OS). And you can share, say, the OS layer across multiple containers, so that if you're deploying many containers to a single machine, it only has to download the OS once.

This is generally okay, but I'm a bit offended¹ that if I rebuild the OS with only a few changes - eg. a couple of Debian packages updated to fix a security hole - then it has to re-download the whole container. First of all, appropriate use of rsync could make this go a lot more smoothly.

But secondly, I already invented a solution, eight years ago, and open sourced it, and then promptly failed to document or advertise it so that (of course) nobody knew it exists. Oops.

The solution is something I call bupdate (a mix of "bup" and "update"), a little-known branch of my bup incremental backup software, which I've written about previously.

Unlike bup, the bupdate client works with any dumb (static files only) http server. bupdate takes any group of files - in this case, tarballs, .iso images, or VM disk images - runs the bupsplit algorithm to divide them into chunks, and writes their file offsets to files ending in .fidx (file index, similar to git's .idx packfile indexes and bup's .midx multi-pack indexes), which you then publish along with the original files. The client downloads the .fidx files, generates its own index of all the local files it already has lying around (eg. old containers, in this case), and constructs exact replicas of the new files out of the old chunks and any necessary newly-downloaded chunks. It requests the new chunks using a series of simple HTTP byterange requests from the image files sitting on the server.

It's pretty neat. There's even an NSIS plugin so that you can have NSIS do the download and reassembly for you when installing a big blob on Windows (which I implemented for one of our clients at one point), like for updating big video game WAD files.

(By the way, this same technique would help a lot with, say, apt-get update's process for retrieving its Packages files. All we'd need to do is upload a Packages.fidx alongside the Packages file itself, and a new client which understood bupdate could use that to retrieve only the parts of the Packages file that has changed since last time. This could reduce incremental Packages downloads from several megabytes to tens of kilobytes. Old clients would ignore the .fidx and just download the whole Packages file as before.)

bupdate is pretty old (8 years now!), and relies on an even older C++ library that probably doesn't work with modern compilers, but it wouldn't be too hard to rejuvenate. Somebody really ought to start using it for updating container layers or frequently-updated large lists. Contact me if you think this might be useful to you, and maybe I'll find time to bring bupdate back to life.

gzip --rsyncable

On that note, if you haven't heard of it already, you really should know about gzip --rsyncable.

It's widely known that gzip'd files don't work well with rsync, because if even one byte changes near the beginning of the file, that'll change the compression for the entire rest of the file, so you have to re-download the whole thing. And bupdate, which is, at its core, really just a one-sided rsync, suffers from the same problem.

But gzip with --rsyncable is different. It carefully changes the compression rules so the dictionary is flushed periodically, such that if you change a few bytes early on, it'll only disrupt the next few kilobytes rather than the entire rest of the file. If you compress your files with --rsyncable, then bupdate will work a lot better.

Alternatively, if you're using a web server that supports on-the-fly compression, you can serve an uncompressed file and let the web server compress the blocks you're requesting. This will be more byte-efficient than gzip --rsyncable (since you don't have to download the entire block, up to the next resync point), but costs more CPU time on the server. Nowadays, CPU time is pretty cheap and gzip is pretty fast, so that might be a good tradeoff.

Footnote

¹ When I say I'm offended by the process used to update containers, it's not so much that I'm offended by people failing to adopt my idea - which, to be fair, I neglected to tell anyone about. Mostly I'm offended that nobody else managed to invent a better idea than bupdate, or even a comparably good idea^2,3, in the intervening 8 years. Truly, there is no point worrying about people stealing my ideas. Rather the opposite.

² Edit 2019-01-13: Eric Anderson pointed me to casync, which is something like bup and bupdate, and references bup as one of its influences. So I guess someone did invent at least a "comparably good idea." I think the bupdate file format is slightly cuter, since it sits alongside and reuses the original static files, which allows for cheap backward compatibility with plain downloads or rsyncs. But I'm biased.

³ Edit 2019-01-13: JNRowe points out a program called zsync, which sounds very similar to bupdate and shares the same goal of not disturbing your original file set. In fact, it's even more clever, because you can publish a zsync index on one web site that refers to chunks on another web site, allowing you to encourage zsync use even if the upstream maintainer doesn't play along. And it can look inside .gz files even if you don't use gzip --rsyncable! Maybe use that instead of bupdate. (Disclaimer: I haven't tried it yet.)

2019-01-11 »