201012 - apenwarr

2010-12-10 »

Two things about programming

Nobody really knows how to do it
If you think you have a reliable system for doing it, you're probably doing the computer's job

2010-12-13 »

Everything you never wanted to know about file locking

(Foreshadowing: I found a bug in MacOS X 10.6's fcntl(F_SETLK) locking that could cause corruption of sqlite databases. To see if your system has the bug, compile and run locky.c.)

I've never previously had the (mis)fortune of writing a program that relied on file locking. Well, I've used databases and gdbm and whatnot, and they "use" file locking, but they've always hidden it behind an API, so I've never had to actually lock my own files.

I heard file locking is terribly messy and arcane and varies wildly between operating systems, even between different Unix systems; even between different versions of the same system (like Linux). After some experimentation, I can confirm: it really is that arcane, and it really is that variable between systems, and it really is untrustworthy. I'm normally a pessimist when it comes to computers, but Unix file locking APIs have actually outdone even my own pessimism: they're worse than I ever imagined.

Other than simple lockfiles, which I won't go into (but which you might just want to use from now on, after reading this article :)), there are three Unix file locking APIs: flock(), fcntl(), and lockf().

flock() locking

flock() is the simplest sort of lock. According to my Linux man page, it dates back to BSD. It is not standardized by POSIX, which means that some Unix systems (probably SysV-related ones, I guess) don't support flock().

flock() locks an entire file at a time. It supports shared locks (LOCK_SH: multiple people can have the file locked for read at the same time) and exclusive locks (LOCK_EX: only one person can make an exclusive lock on the file; shared and exclusive locks are not allowed to coexist). If you learned about concurrency in textbooks, flock() locks are "reader/writer" locks. A shared lock is a reader lock, and an exclusive lock is a writer lock.

According to the Linux man page for flock(), flock() does not work over NFS.

Upgrading from a shared flock() lock to an exclusive one is racy. If you own a shared lock, then trying to upgrade it to an exclusive lock, behind the scenes, actually involves releasing your lock and acquiring a new one. Thus, you can't be guaranteed that someone else hasn't acquired the exclusive lock, written to the file, and unlocked it before your attempt at upgrading the lock returns. Moreover, if you try to upgrade from shared to exclusive in a non-blocking way (LOCK_NB), you might lose your shared lock entirely.

Supposedly, flock() locks persist across fork() and (unlike fcntl locks, see below) won't disappear if you close unrelated files. HOWEVER, you can't depend on this, because some systems - notably earlier versions of Linux - emulated flock() using fcntl(), which has totally different semantics. If you fork() or if you open the same file more than once, you should assume the results with flock() are undefined.

fcntl() locking

POSIX standardized the fcntl(F_SETLK) style of locks, so you should theoretically find them on just about any Unix system. The Linux man page claims that they've been around since 4.3BSD and SVr4.

fcntl() locks are much more powerful than flock() locks. Like flock(), there are both shared and exclusive locks. However, unlike flock(), each lock has a byte range associated with it: different byte ranges are completely independent. One process can have an exclusive lock on byte 7, and another process can have an exclusive lock on byte 8, and several processes can have a shared lock on byte 9, and that's all okay.

Note that calling them "byte ranges" is a bit meaningless; they're really just numbers. You can lock bytes past the end of the file, for example. You could lock bytes 9-12 and that might mean "records 9-12" if you want, even if records have variable length. The meaning of a fcntl() byte range is up to whoever defines the file format you're locking.

As with all the locks discussed in this article, these byterange locks are "advisory" - that is, you can read and write the file all day long even if someone other than you has an exclusive lock. You're just not supposed to do that. A properly written program will try to acquire the lock first, at which time it will be "advised" by the kernel whether everything is good or not.

The locks are advisory, which is why the byte ranges don't have to refer to actual bytes. The person acquiring the lock can interpret it however you want.

fcntl() locks are supposedly supported by NFSv3 and higher, but different kernels do different random things with it. Some kernels just lock the file locally, and don't notify the server. Some notify the server, but do it wrong. So you probably can't trust it.

According to various pages on the web that I've seen, fcntl() locks don't work on SMB (Windows file sharing) filesystems mounted on MacOS X. I don't know if this is still true in the latest versions of MacOS; I also don't know if it's true on Linux. Note that since flock() doesn't work on NFS, and fcntl() doesn't work on SMB fs, that there is no locking method that works reliably on all remote filesystems.

It doesn't seem to be explicitly stated anywhere, but it seems that fcntl() shared locks can be upgraded atomically to fcntl() exclusive locks. That is, if you have a shared lock and you try to upgrade it to an exclusive lock, you can do that without first releasing your shared lock. If you request a non-blocking upgrade and it fails, you still have your shared lock.

(Trivia: sqlite3 uses fcntl() locks, but it never uses shared fcntl() locks. Instead, it exclusively locks a random byte inside a byterange. This is apparently because some versions of Windows don't understand shared locks. As a bonus, it also doesn't have to care whether upgrading a lock from shared to exclusive is atomic or not. Update 2010/12/13: specifically, pre-NT versions of Windows had LockFile, but not LockFileEx.)

fcntl() locks have the handy feature of being able to tell you which pid owns a lock, using F_GETLK. That's pretty cool - and potentially useful for debugging - although it might be meaningless on NFS, where the pid could be on another computer. I don't know what would happen in that case.

fcntl() locks have two very strange behaviours. The first is merely an inconvenience: unlike nearly everything else about a process, fcntl() locks are not shared across fork(). That means if you open a file, lock some byte ranges, and fork(), the child process will still have the file, but it won't own any locks. The parent will still own all the locks. This is weird, but manageable, once you know about it. It also makes sense, in a perverse sort of way: this makes sure that no two processes have an exclusive lock on the same byterange of the same file. If you think about it, exclusively locking a byte range, then doing fork(), would mean that two processes have the same exclusive lock, so it's not all that exclusive any more, is it?

Maybe you don't care about these word games, but one advantage of this absolute exclusivity guarantee is that fcntl() locks can detect deadlocks. If process A has a lock on byte 5 and tries to lock byte 6, and process B has a lock on byte 6 and tries to lock byte 5, the kernel can give you EDEADLK, which is kind of cool. If it were possible for more than one process to own the same exclusive locks, the algorithm for this would be much harder, which is probably why flock() locks can't do it.

The second strange behaviour of fcntl() locks is this: the lock doesn't belong to a file descriptor, it belongs to a (pid,inode) pair. Worse, if you close any fd referring to that inode, it releases all your locks on that inode. For example, let's say you open /etc/passwd and lock some bytes, and then call getpwent() in libc, which usually opens /etc/passwd, reads some stuff, and closes it again. That process of opening /etc/passwd and closing it - which has nothing to do with your existing fd open on /etc/passwd - will cause you to lose your locks!

That behaviour is certifiably insane, and there's no possible justification for why it should work that way. But it's how it's always worked, and POSIX standardized it, and now you're stuck with it.

An even worse example: let's say you have two sqlite databases, db1 and db2. But let's say you're being mean, and you actually make db1 a hardlink to db2, so they're actually the same inode. If you open both databases in sqlite at the same time, then close the second one, all your open sqlite locks on the first one will be lost! Oops! Except, actually, the sqlite people have already thought of this, and it does the right thing. But if you're writing your own file locking routines, beware.

(Update 2014/06/09: the primary sqlite developer emailed me to say that they recently found a problem with the technique sqlite uses. They use global variables to make sure lock objects are shared across all sqlite databases, even if you have two open on the same inode, even if that inode has more than one filename, so that they can bypass this insanity. But this mechanism is defeated if you manage to link your program with two copies of sqlite which can't see each other (eg. in two shared libraries with no sqlite symbols exported) and which access the same database. Don't do that.)

So anyway, beware of that insane behaviour. Also beware of flock(), which on some systems is implemented as a call to fcntl(), and thus inherits the same insane behaviour.

Bonus insanity feature: the struct you use to talk to fcntl() locks is called 'struct flock', even though it has nothing to do with flock(). Ha ha!

lockf() locking

lockf() is also standardized by POSIX. The Linux man page also mentions SVr4, but it doesn't mention BSD, which presumably means that some versions of BSD don't do lockf().

POSIX is also, apparently, unclear on whether lockf() locks are the same thing as fcntl() locks or not. On Linux and MacOS, they are documented to be the same thing. (In fact, lockf() is a libc call on Linux, not a system call, so we can assume it makes the same system calls as fcntl().)

The API of lockf() is a little easier than fcntl(), because you don't have to mess around with a struct. However, there is no way to query a lock to find out who owns it.

Moreover, lockf() may not be supported by pre-POSIX BSD systems, it seems, so this little bit of convenience also costs you in portability. I recommend you avoid lockf().

Interaction between different lock types

...is basically undefined. Don't use multiple types of locks - flock(), fcntl(), lockf() - on the same file.

The MacOS man pages for the three functions proudly proclaim that on MacOS (and maybe on whatever BSD MacOS is derived from), the three types of locks are handled by a unified locking implementation, so in fact, you can mix and match different lock types on the same file. That's great, but on other systems, they aren't unified, so doing so will just make your program fail strangely on other systems. It's non-portable, and furthermore, there's no reason to do it. So don't.

When you define a new file format that uses locking, be sure to document exactly which kind of locking you mean: flock(), fcntl(), or lockf(). And don't use lockf().

Mandatory locking

Mandatory locks are the ones that are self-enforcing: if I have an exclusive lock on (a section of) the file, you can't read or write it until I release the lock, even if your program doesn't understand locks. It sounds good, but it's bad.

Stay far, far away, for total insanity lies in wait.

Seriously, don't do it. Advisory locks are the only thing that makes any sense whatsoever. In any application. I mean it.

Need another reason? The docs say that mandatory locking in Linux is "unreliable." In other words, they're not as mandatory as they're documented to be. "Almost mandatory" locking? Look. Just stay away.

Still not convinced? Man, you really must like punishment. Look, imagine someone is holding a mandatory lock on a file, so you try to read() from it and get blocked. Then the other person releases their lock, and your read() finishes, but some third person reacquires the lock. You fiddle with your block, modify it, and try to write() it back, but you get held up for a bit, because the third person holding the lock isn't done yet. They do their own write() to that section of the file, and releases their lock, so your write() promptly resumes and overwrites what they just did.

What good does that do anyone? Come on. If you want locking to do you any good whatsoever, you're just going to have to acquire and release your own locks. Just do it. If you don't, you might as well not be using locks at all, because your program will be prone to horrible race conditions and they'll be extra hard to find, because mandatory locks will make it mostly seem to work right. If there's one thing I've learned about programming, it's that "mostly right" programs are way worse than "entirely wrong" programs. You don't want to be mostly right. Don't use mandatory locks.

Bonus feature: file locking in python

python has a module called "fcntl" that actually includes - or rather, seems to include - all three kinds of locks: flock(), fcntl(), and lockf(). If you like, follow along in the python source code to see how it works.

However, all is not as it seems. First of all, flock() doesn't exist on all systems, apparently. If you're on a system without flock(), python will still provide a fcntl.flock() function... by calling fcntl() for you. So you have no idea if you're actually getting fcntl() locks or flock() locks. Bad idea. Don't do it.

Next is fcntl.fcntl(). Although it pains me to say it, you can't use this one either. That's because it takes a binary data structure as a parameter. You have to create that data structure using struct.pack(), and parse it using struct.unpack(). No problem, right? Wrong. How do you know what the data structure looks like? The python fcntl module docs outright lie to you here, by providing an example of how to build the struct flock... but they just made assumptions about what it looks like. Those assumptions are definitely wrong if your system has 64-bit file offset support, which most of them do nowadays, so trying to use the example will just give an EINVAL. Moreover, POSIX doesn't guarantee that struct flock won't have other fields before/after the documented ones, or that the fields will be in a particular order, so even without 64-bit file offsets, your program is completely non-portable. And python doesn't offer any other option for generating that struct flock, so the whole function is useless. Don't do it. (You can still safely use fcntl.fcntl() for non-locking-related features, like F_SETFD.)

The only one left is fcntl.lockf(). This is the one you should use. Yeah, I know, up above I said you should avoid lockf(), because BSD systems might not have it, right? Well yeah, but that's C lockf(), not python's fcntl.lockf(). The python fcntl module documentation says of fcntl.lockf(), "This is essentially a wrapper around the fcntl() locking calls." But looking at the source, that's not quite true: in fact, it is exactly a wrapper around the fcntl() locking calls. fcntl.lockf() doesn't call C lockf() at all! It just fills in a struct flock and then calls fcntl()!

And that's exactly what you want. In short:

in C, use fcntl(), but avoid lockf(), which is not necessarily the same thing.
in python, use fcntl.lockf(), which is the same thing as fcntl() in C.

(Unfortunately, although calling fnctl.lockf() actually uses fcntl() locks, there is no way to run F_GETLK, so you can't find out which pid owns the lock.)

Bonus insanity feature: instead of using the C lockf() constants (F_LOCK, F_TLOCK, F_ULOCK, F_TEST), fcntl.lockf() actually uses the C flock() constants (LOCK_SH, LOCK_EX, LOCK_UN, LOCK_NB). There is no conceivable reason for this; it literally just takes in the wrong contants, and converts them to the right ones before calling fcntl().

So that means python gives you three locks in one! The constants from flock(), the functionality of fcntl(), and the name lockf(). Thanks, python, for making my programming world so simple and easy to unravel.

Epilogue

I learned all this while writing a program (in python, did you guess?) that uses file locking to control concurrent access to files. Naturally, I wanted to pick exactly the right kind of locks to solve my problem. Using the logic above, I settled on fcntl() locks, which in my python program means calling fcntl.lockf().

So far, so good. After several days of work - darn it, I really hate concurrent programming - I got it all working perfectly. On Linux.

Then I tried to port my program to MacOS. It's python, so porting was pretty easy - ie. I didn't change anything - but the tests failed. Huh? Digging deeper, it seems that some subprocesses were acquiring a lock, and sometime later, they just didn't own that lock anymore. I thought it might be one of the well-known fcntl() weirdnesses. Maybe I fork()ed, or maybe I opened/closed the file without realizing it. But no, it only happens when other processes are locking byteranges on the same file. It appears the MacOS X (10.6.5 in my test) kernel is missing a mutex somewhere.

I wrote a minimal test case and filed a bug with Apple. If you work at Apple, you can find my bug report as number 8760769.

Dear Apple: please fix it. As far as I know, with this bug in place, any program that uses fcntl() locks is prone to silent file corruption. That includes anything using sqlite.

Super Short Summary

don't use flock() (python: fcntl.flock()) because it's not in POSIX and it doesn't work over NFS.
don't use lockf() (python: does not exist) because it's not in BSD, and probably just wraps fcntl().
don't use fcntl() (python: fcntl.lockf()) because it doesn't work over SMB on MacOS, and actually, on MacOS it doesn't work right at all.

Oh, and by the way, none of this applies on win32.

Are we having fun yet? I guess lockfiles are the answer after all.

I bet you're really glad you read this all the way to the end.

Update 2010/12/15: Via cpirate, an interesting article by Jeremy Allison of the samba project that discusses a bit of how Unix's insane locking came to be standardized.

Update 2019/01/02: Progress! In 2015, Linux gained a new kind of fcntl() lock, F_OFD_SETLK ("Open File Description" locks), which work the way you might have expected fcntl() locks to work in the first place. Of course, for backward compatibility, they can never change how the old way worked, so they had to add a new API, which unfortunately means every program will have to choose between a new+sane API and an old+portable API. Hopefully other OSes will copy this feature eventually. You can read about it here: https://lwn.net/Articles/640404/.

Meanwhile, Windows 10 WSL ("Windows Subsystem for Linux") has come out, and it "implements" fcntl() locks by just always returning success, leading to file corruption everywhere. The above locky.c can detect this failure as well.

And, some time after this article was written, MacOS was fixed, so now fcntl() locks seem to work as expected. Yay!

2010-12-14 »

The only build system that might someday replace make...

...is djb redo.

There are only two problems. In order of increasing difficulty:

you've never heard of it.
it doesn't exist.

Well, I just solved problem #1. Progress!

Update 2019-11-24:

building containers in redo

producing multiple outputs with one redo target

Problem #2 is a lot tougher. You see, the page linked above talks about a theoretically great build system, which perhaps D. J. Bernstein (djb), maker of qmail and djbdns, has implemented part of. But he doesn't link to an implementation, probably because his own code isn't up to his (rather exacting) standards yet, or he's gotten busy with other things and hasn't finished thinking through all the details.

I would like to tell you, in a concise way, exactly why redo is literally the most amazingly groundbreaking build system since the original invention of 'make', but I don't know how. So lacking conciseness, I'll just make a few grandiose claims:

it can do everything make can do;
with no baked-in assumptions about what you're building;
with much less code;
with much greater parallelism;
with finer-grained dependencies;
with much less syntax (actually nothing but /bin/sh);
while supporting recursion and full dependency information simultaneously (no Recursive Make Considered Harmful crap);
yet build scripts are highly modular and readable;
and you can checksum your targets instead of using timestamps;
and your build scripts run linearly instead of an orderless "ruleset";
with no implicit rules required;
and implementing C header autodependencies is completely sane;
and dependency checks involve no forking or parsing so it's crazy fast;
and you can incrementally convert parts of your project;
because it can play well with other build systems;
including jobserver compatibility with make -j;
oh, and you can write a plug-compatible toy
implementation in 100 lines of shell.

Sounds awesome? You bet it is. Well, except for the not existing part.

So I wrote it

Yes. djb's design is so simple that I thought I could do it in a couple of days. Actually I did have a working version in a couple of days. And normally, I'm a big fan of "release early, release often." But when it comes to build systems, well, you know, it's kind of a crowded market. I figured I'm only going to get one chance to convince you that this is the most fabulous thing ever.

That's why I've spent more than a month - days, evenings, and weekends - of my so-called vacation¹ working on this stupid thing. It has man pages. It has a ridiculously huge README+FAQ. It has an installer. It has parallelism. It has unit tests, oh god, the unit tests. It has locking. In fact, it has so much parallelism and locking that it uncovered a race condition in MacOS X's fcntl(). It has a GNU make compatible jobserver, so that 'redo -j5' and 'make -j5' can call each other recursively and "just work." It has a few extensions that djb didn't talk about, like checksum-based dependencies. For testing, I converted a pretty huge build system (one that builds the Linux kernel and a bunch of libraries for two different target platforms), with some of my targets depending on more than 12000 files. redo can check every single fine-grained dependency in the whole project in about 4 seconds.

Version 0.01 actually delivers on every one of those grandiose claims.

As I'm writing this, it's 1574 lines of python. That's a lot smaller than the source for GNU make.

And for fun, I also wrote a super-stripped-down "forwards compatible" implementation (without support for actual dependencies; it just assumes everything is always dirty) in 100 lines of shell. That's less than 2 kbytes, and it works with any input file that works with full-sized redo (modulo any as-yet-undiscovered bugs), because djb's design is that awesome.

And you know what? I almost certainly still, after all that effort, screwed up and it probably still won't work for you. Dammit. That's why I paranoidly called it version 0.01 instead of 1.00. But please give it a try:

apenwarr's redo on github

(The super oversized README is visible at that link. Or you can read the man pages.)

Disclaimer

Because redo is currently written in python (and it does a lot of fork-exec calls), it runs a little slower than I would like. Someday rewriting it in C would make it faster - probably at least 10x faster. But it's reasonably fast already, and the increased parallelism and faster dependency checking often makes up for the slowness.

The non-optimal performance is the only significant flaw I'm aware of at the moment... but I'm sure you'll try it out and tell me otherwise, right?

You should join the redo mailing list

You can find the mailing list on Google Groups at http://groups.google.com/group/redo-list.

Of course the mailing list archives are currently empty, because this is the first time I've announced it.

It might not look like it, but yes, you can subscribe without having a Google Account. Just send a message here: redo-list+subscribe@googlegroups.com

Footnote

¹ I told the guard at the U.S./Canada border that I was "between jobs." He offered his sympathies. I said, "No, I mean, literally." He said, "Oh, normally people use that as a euphemism." Anyway, it's now painfully clear that I suck at vacations.

2010-12-22 »

git vs. professionalism

I followed a random link today that led me to the complete email archive of the discussion, following Theo de Raadt's expulsion from NetBSD in 1994, that led to the creation of OpenBSD.

Check out this quote:

I want _EVERYONE_ working on NetBSD to behave in a professional manner, in public _AND_ in private. Obviously, the latter can't be enforced, but if people complain about "unprofessional" behaviour, then it doesn't rightly matter _where_ it occurred.

-- a NetBSD core team member

The entire email chain - from both sides - is totally pointless politics. But it's educational, because it reminds us of a few things programmers have learned since then:

enforced professionalism sounds scarily draconian;
the best programmers are not the best company representatives, and expecting them to be is silly;
withholding technical stuff (CVS access) in exchange for political stuff (an apology) just results in angst and wasted time;
accepting/refusing code based on anything other than technical merits (and valid licensing, I suppose) is stupid;
when you give a "core team" special powers, they abuse those powers;
you can hold emotional power over a programmer by controlling whether their work is used or not used, and that power needs to be used responsibly.

When I say we've learned these things, maybe I'm just being optimistic. Probably there are plenty of teams with equally stupid politics going on. I hear the Java Community Process is similarly a waste of time.

What would happen if you tried to kick Linus Torvalds off the "Linux core team?" He would laugh at you. There is no core team. There is nothing to kick him off of. There is no CVS commit access to revoke. He can be as much of a bastard as he wants, and a bunch of self-appointed "project representatives" can't stop him from giving us all the awesome code he wants to.

Of course, Linus is synonymous with Linux; if anyone was going to be "kicked out", it would be anyone or everyone but him. So it's even more funny that he's the guy who wrote git, the tool that makes it impossible to block anybody. Now he can be even more of a bastard, because even if people hate him and quit, they'll still share their code with the world, including him.

As far as I know, all the BSDs still use CVS. Of course they do. Switching away from that would be like admitting their whole world view is wrong.

that is what the above is. if it has anger in it, then that is what is in it. it is an honest and open sharing of ideas and feelings. those feelings include anger, hurt, apprehension, loss. but i'm trying to find a way to forget about that past stuff, and you're not making it easy.

-- Theo de Raadt