Tasty, nutritious

...part of this complete breakfast
Everything here is my opinion. I do not speak for your employer.
December 2010
January 2011

2010-12-13 »

Everything you never wanted to know about file locking

(Foreshadowing: I found a bug in MacOS X 10.6's fcntl(F_SETLK) locking that could cause corruption of sqlite databases. To see if your system has the bug, compile and run locky.c.)

I've never previously had the (mis)fortune of writing a program that relied on file locking. Well, I've used databases and gdbm and whatnot, and they "use" file locking, but they've always hidden it behind an API, so I've never had to actually lock my own files.

I heard file locking is terribly messy and arcane and varies wildly between operating systems, even between different Unix systems; even between different versions of the same system (like Linux). After some experimentation, I can confirm: it really is that arcane, and it really is that variable between systems, and it really is untrustworthy. I'm normally a pessimist when it comes to computers, but Unix file locking APIs have actually outdone even my own pessimism: they're worse than I ever imagined.

Other than simple lockfiles, which I won't go into (but which you might just want to use from now on, after reading this article :)), there are three Unix file locking APIs: flock(), fcntl(), and lockf().

flock() locking

flock() is the simplest sort of lock. According to my Linux man page, it dates back to BSD. It is not standardized by POSIX, which means that some Unix systems (probably SysV-related ones, I guess) don't support flock().

flock() locks an entire file at a time. It supports shared locks (LOCK_SH: multiple people can have the file locked for read at the same time) and exclusive locks (LOCK_EX: only one person can make an exclusive lock on the file; shared and exclusive locks are not allowed to coexist). If you learned about concurrency in textbooks, flock() locks are "reader/writer" locks. A shared lock is a reader lock, and an exclusive lock is a writer lock.

According to the Linux man page for flock(), flock() does not work over NFS.

Upgrading from a shared flock() lock to an exclusive one is racy. If you own a shared lock, then trying to upgrade it to an exclusive lock, behind the scenes, actually involves releasing your lock and acquiring a new one. Thus, you can't be guaranteed that someone else hasn't acquired the exclusive lock, written to the file, and unlocked it before your attempt at upgrading the lock returns. Moreover, if you try to upgrade from shared to exclusive in a non-blocking way (LOCK_NB), you might lose your shared lock entirely.

Supposedly, flock() locks persist across fork() and (unlike fcntl locks, see below) won't disappear if you close unrelated files. HOWEVER, you can't depend on this, because some systems - notably earlier versions of Linux - emulated flock() using fcntl(), which has totally different semantics. If you fork() or if you open the same file more than once, you should assume the results with flock() are undefined.

fcntl() locking

POSIX standardized the fcntl(F_SETLK) style of locks, so you should theoretically find them on just about any Unix system. The Linux man page claims that they've been around since 4.3BSD and SVr4.

fcntl() locks are much more powerful than flock() locks. Like flock(), there are both shared and exclusive locks. However, unlike flock(), each lock has a byte range associated with it: different byte ranges are completely independent. One process can have an exclusive lock on byte 7, and another process can have an exclusive lock on byte 8, and several processes can have a shared lock on byte 9, and that's all okay.

Note that calling them "byte ranges" is a bit meaningless; they're really just numbers. You can lock bytes past the end of the file, for example. You could lock bytes 9-12 and that might mean "records 9-12" if you want, even if records have variable length. The meaning of a fcntl() byte range is up to whoever defines the file format you're locking.

As with all the locks discussed in this article, these byterange locks are "advisory" - that is, you can read and write the file all day long even if someone other than you has an exclusive lock. You're just not supposed to do that. A properly written program will try to acquire the lock first, at which time it will be "advised" by the kernel whether everything is good or not.

The locks are advisory, which is why the byte ranges don't have to refer to actual bytes. The person acquiring the lock can interpret it however you want.

fcntl() locks are supposedly supported by NFSv3 and higher, but different kernels do different random things with it. Some kernels just lock the file locally, and don't notify the server. Some notify the server, but do it wrong. So you probably can't trust it.

According to various pages on the web that I've seen, fcntl() locks don't work on SMB (Windows file sharing) filesystems mounted on MacOS X. I don't know if this is still true in the latest versions of MacOS; I also don't know if it's true on Linux. Note that since flock() doesn't work on NFS, and fcntl() doesn't work on SMB fs, that there is no locking method that works reliably on all remote filesystems.

It doesn't seem to be explicitly stated anywhere, but it seems that fcntl() shared locks can be upgraded atomically to fcntl() exclusive locks. That is, if you have a shared lock and you try to upgrade it to an exclusive lock, you can do that without first releasing your shared lock. If you request a non-blocking upgrade and it fails, you still have your shared lock.

(Trivia: sqlite3 uses fcntl() locks, but it never uses shared fcntl() locks. Instead, it exclusively locks a random byte inside a byterange. This is apparently because some versions of Windows don't understand shared locks. As a bonus, it also doesn't have to care whether upgrading a lock from shared to exclusive is atomic or not. Update 2010/12/13: specifically, pre-NT versions of Windows had LockFile, but not LockFileEx.)

fcntl() locks have the handy feature of being able to tell you which pid owns a lock, using F_GETLK. That's pretty cool - and potentially useful for debugging - although it might be meaningless on NFS, where the pid could be on another computer. I don't know what would happen in that case.

fcntl() locks have two very strange behaviours. The first is merely an inconvenience: unlike nearly everything else about a process, fcntl() locks are not shared across fork(). That means if you open a file, lock some byte ranges, and fork(), the child process will still have the file, but it won't own any locks. The parent will still own all the locks. This is weird, but manageable, once you know about it. It also makes sense, in a perverse sort of way: this makes sure that no two processes have an exclusive lock on the same byterange of the same file. If you think about it, exclusively locking a byte range, then doing fork(), would mean that two processes have the same exclusive lock, so it's not all that exclusive any more, is it?

Maybe you don't care about these word games, but one advantage of this absolute exclusivity guarantee is that fcntl() locks can detect deadlocks. If process A has a lock on byte 5 and tries to lock byte 6, and process B has a lock on byte 6 and tries to lock byte 5, the kernel can give you EDEADLK, which is kind of cool. If it were possible for more than one process to own the same exclusive locks, the algorithm for this would be much harder, which is probably why flock() locks can't do it.

The second strange behaviour of fcntl() locks is this: the lock doesn't belong to a file descriptor, it belongs to a (pid,inode) pair. Worse, if you close any fd referring to that inode, it releases all your locks on that inode. For example, let's say you open /etc/passwd and lock some bytes, and then call getpwent() in libc, which usually opens /etc/passwd, reads some stuff, and closes it again. That process of opening /etc/passwd and closing it - which has nothing to do with your existing fd open on /etc/passwd - will cause you to lose your locks!

That behaviour is certifiably insane, and there's no possible justification for why it should work that way. But it's how it's always worked, and POSIX standardized it, and now you're stuck with it.

An even worse example: let's say you have two sqlite databases, db1 and db2. But let's say you're being mean, and you actually make db1 a hardlink to db2, so they're actually the same inode. If you open both databases in sqlite at the same time, then close the second one, all your open sqlite locks on the first one will be lost! Oops! Except, actually, the sqlite people have already thought of this, and it does the right thing. But if you're writing your own file locking routines, beware.

(Update 2014/06/09: the primary sqlite developer emailed me to say that they recently found a problem with the technique sqlite uses. They use global variables to make sure lock objects are shared across all sqlite databases, even if you have two open on the same inode, even if that inode has more than one filename, so that they can bypass this insanity. But this mechanism is defeated if you manage to link your program with two copies of sqlite which can't see each other (eg. in two shared libraries with no sqlite symbols exported) and which access the same database. Don't do that.)

So anyway, beware of that insane behaviour. Also beware of flock(), which on some systems is implemented as a call to fcntl(), and thus inherits the same insane behaviour.

Bonus insanity feature: the struct you use to talk to fcntl() locks is called 'struct flock', even though it has nothing to do with flock(). Ha ha!

lockf() locking

lockf() is also standardized by POSIX. The Linux man page also mentions SVr4, but it doesn't mention BSD, which presumably means that some versions of BSD don't do lockf().

POSIX is also, apparently, unclear on whether lockf() locks are the same thing as fcntl() locks or not. On Linux and MacOS, they are documented to be the same thing. (In fact, lockf() is a libc call on Linux, not a system call, so we can assume it makes the same system calls as fcntl().)

The API of lockf() is a little easier than fcntl(), because you don't have to mess around with a struct. However, there is no way to query a lock to find out who owns it.

Moreover, lockf() may not be supported by pre-POSIX BSD systems, it seems, so this little bit of convenience also costs you in portability. I recommend you avoid lockf().

Interaction between different lock types

...is basically undefined. Don't use multiple types of locks - flock(), fcntl(), lockf() - on the same file.

The MacOS man pages for the three functions proudly proclaim that on MacOS (and maybe on whatever BSD MacOS is derived from), the three types of locks are handled by a unified locking implementation, so in fact, you can mix and match different lock types on the same file. That's great, but on other systems, they aren't unified, so doing so will just make your program fail strangely on other systems. It's non-portable, and furthermore, there's no reason to do it. So don't.

When you define a new file format that uses locking, be sure to document exactly which kind of locking you mean: flock(), fcntl(), or lockf(). And don't use lockf().

Mandatory locking

Mandatory locks are the ones that are self-enforcing: if I have an exclusive lock on (a section of) the file, you can't read or write it until I release the lock, even if your program doesn't understand locks. It sounds good, but it's bad.

Stay far, far away, for total insanity lies in wait.

Seriously, don't do it. Advisory locks are the only thing that makes any sense whatsoever. In any application. I mean it.

Need another reason? The docs say that mandatory locking in Linux is "unreliable." In other words, they're not as mandatory as they're documented to be. "Almost mandatory" locking? Look. Just stay away.

Still not convinced? Man, you really must like punishment. Look, imagine someone is holding a mandatory lock on a file, so you try to read() from it and get blocked. Then the other person releases their lock, and your read() finishes, but some third person reacquires the lock. You fiddle with your block, modify it, and try to write() it back, but you get held up for a bit, because the third person holding the lock isn't done yet. They do their own write() to that section of the file, and releases their lock, so your write() promptly resumes and overwrites what they just did.

What good does that do anyone? Come on. If you want locking to do you any good whatsoever, you're just going to have to acquire and release your own locks. Just do it. If you don't, you might as well not be using locks at all, because your program will be prone to horrible race conditions and they'll be extra hard to find, because mandatory locks will make it mostly seem to work right. If there's one thing I've learned about programming, it's that "mostly right" programs are way worse than "entirely wrong" programs. You don't want to be mostly right. Don't use mandatory locks.

Bonus feature: file locking in python

python has a module called "fcntl" that actually includes - or rather, seems to include - all three kinds of locks: flock(), fcntl(), and lockf(). If you like, follow along in the python source code to see how it works.

However, all is not as it seems. First of all, flock() doesn't exist on all systems, apparently. If you're on a system without flock(), python will still provide a fcntl.flock() function... by calling fcntl() for you. So you have no idea if you're actually getting fcntl() locks or flock() locks. Bad idea. Don't do it.

Next is fcntl.fcntl(). Although it pains me to say it, you can't use this one either. That's because it takes a binary data structure as a parameter. You have to create that data structure using struct.pack(), and parse it using struct.unpack(). No problem, right? Wrong. How do you know what the data structure looks like? The python fcntl module docs outright lie to you here, by providing an example of how to build the struct flock... but they just made assumptions about what it looks like. Those assumptions are definitely wrong if your system has 64-bit file offset support, which most of them do nowadays, so trying to use the example will just give an EINVAL. Moreover, POSIX doesn't guarantee that struct flock won't have other fields before/after the documented ones, or that the fields will be in a particular order, so even without 64-bit file offsets, your program is completely non-portable. And python doesn't offer any other option for generating that struct flock, so the whole function is useless. Don't do it. (You can still safely use fcntl.fcntl() for non-locking-related features, like F_SETFD.)

The only one left is fcntl.lockf(). This is the one you should use. Yeah, I know, up above I said you should avoid lockf(), because BSD systems might not have it, right? Well yeah, but that's C lockf(), not python's fcntl.lockf(). The python fcntl module documentation says of fcntl.lockf(), "This is essentially a wrapper around the fcntl() locking calls." But looking at the source, that's not quite true: in fact, it is exactly a wrapper around the fcntl() locking calls. fcntl.lockf() doesn't call C lockf() at all! It just fills in a struct flock and then calls fcntl()!

And that's exactly what you want. In short:

  • in C, use fcntl(), but avoid lockf(), which is not necessarily the same thing.
  • in python, use fcntl.lockf(), which is the same thing as fcntl() in C.

(Unfortunately, although calling fnctl.lockf() actually uses fcntl() locks, there is no way to run F_GETLK, so you can't find out which pid owns the lock.)

Bonus insanity feature: instead of using the C lockf() constants (F_LOCK, F_TLOCK, F_ULOCK, F_TEST), fcntl.lockf() actually uses the C flock() constants (LOCK_SH, LOCK_EX, LOCK_UN, LOCK_NB). There is no conceivable reason for this; it literally just takes in the wrong contants, and converts them to the right ones before calling fcntl().

So that means python gives you three locks in one! The constants from flock(), the functionality of fcntl(), and the name lockf(). Thanks, python, for making my programming world so simple and easy to unravel.

Epilogue

I learned all this while writing a program (in python, did you guess?) that uses file locking to control concurrent access to files. Naturally, I wanted to pick exactly the right kind of locks to solve my problem. Using the logic above, I settled on fcntl() locks, which in my python program means calling fcntl.lockf().

So far, so good. After several days of work - darn it, I really hate concurrent programming - I got it all working perfectly. On Linux.

Then I tried to port my program to MacOS. It's python, so porting was pretty easy - ie. I didn't change anything - but the tests failed. Huh? Digging deeper, it seems that some subprocesses were acquiring a lock, and sometime later, they just didn't own that lock anymore. I thought it might be one of the well-known fcntl() weirdnesses. Maybe I fork()ed, or maybe I opened/closed the file without realizing it. But no, it only happens when other processes are locking byteranges on the same file. It appears the MacOS X (10.6.5 in my test) kernel is missing a mutex somewhere.

I wrote a minimal test case and filed a bug with Apple. If you work at Apple, you can find my bug report as number 8760769.

Dear Apple: please fix it. As far as I know, with this bug in place, any program that uses fcntl() locks is prone to silent file corruption. That includes anything using sqlite.

Super Short Summary

  • don't use flock() (python: fcntl.flock()) because it's not in POSIX and it doesn't work over NFS.
  • don't use lockf() (python: does not exist) because it's not in BSD, and probably just wraps fcntl().
  • don't use fcntl() (python: fcntl.lockf()) because it doesn't work over SMB on MacOS, and actually, on MacOS it doesn't work right at all.

Oh, and by the way, none of this applies on win32.

Are we having fun yet? I guess lockfiles are the answer after all.

I bet you're really glad you read this all the way to the end.

Update 2010/12/15: Via cpirate, an interesting article by Jeremy Allison of the samba project that discusses a bit of how Unix's insane locking came to be standardized.

Update 2019/01/02: Progress! In 2015, Linux gained a new kind of fcntl() lock, F_OFD_SETLK ("Open File Description" locks), which work the way you might have expected fcntl() locks to work in the first place. Of course, for backward compatibility, they can never change how the old way worked, so they had to add a new API, which unfortunately means every program will have to choose between a new+sane API and an old+portable API. Hopefully other OSes will copy this feature eventually. You can read about it here: https://lwn.net/Articles/640404/.

Meanwhile, Windows 10 WSL ("Windows Subsystem for Linux") has come out, and it "implements" fcntl() locks by just always returning success, leading to file corruption everywhere. The above locky.c can detect this failure as well.

And, some time after this article was written, MacOS was fixed, so now fcntl() locks seem to work as expected. Yay!

I'm CEO at Tailscale, where we make network problems disappear.

Why would you follow me on twitter? Use RSS.

apenwarr on gmail.com