runlock

runlock

My recent posting of some "controversial" source code seems to have piqued people's interest, what with its coverage on YCombinator and Reddit. In the interests of calming things down a bit, here's some hopefully non-controversial code that I've found very useful.

runlock is a simple perl script that creates and locks a lockfile, then runs whatever command line you give it. If the lockfile is already locked, it doesn't run the program and exits immediately. We can use this in a frequently-running cron job, for example, to ensure that if the job occasionally takes a long time to run, we don't accidentally cause a backlog by starting it over and over again.

Sounds simple? Well, it's short, but the details are pretty tricky. Let's go through the source code and look at some of its interesting parts.

    #!/usr/bin/perl -w
    use strict;
    use LockFile::Simple;

You probably want to know about the LockFile::Simple perl module. It does pretty much everything you want to do with lockfiles. Unfortunately, its defaults are insane and will get you into a ton of trouble if you use them. It's pretty obvious that the author of this module has learned a lot over time.

    if (@ARGV < 2) {
        print STDERR "Usage: $0 <lockfile> <command line...>\n";
        exit 127;
    }

Above we check to make sure the argument list is okay. Nothing too special here, except for one thing: we return 127 in case of an error, because the more common error codes might be returned by the subprogram we're running. runlock is intended to be used whenever you might normally run the given program directly, so it's important not to eat the return code of the subprogram.

    my $lm = LockFile::Simple->make(-stale=>1, -hold=>0) 
        or die("makelock: $!\n");

Here's the first tricky bit: the correct options to LockFile::Simple. "-stale=>1" means that we should "automatically detect stale locks." Now, this sounds like it's obviously a good thing, but is for some reason not the default.

The way this sort of lockfile works is that you use a set of atomic operations to write your pid (process id) to the lockfile. Then, other programs that want to check if the lock is valid first check if the file exists, then open it and read the pid, then "kill -0 $pid" (send a no-op signal to the process) to see if it's still running. If the process is dead, they delete the lockfile and try to create a new one.

If you don't enable "-stale=>1", the LockFile library will just abort if the file exists at all. This means your system will require manual intervention if the locking process ever dies suddenly (eg. by "kill -9" or if your system crashes), which is no fun.

The next option, "-hold=>0", disables a trojan horse extremely evil option that is enabled automatically when you set "-stale=>1". The "-hold" option sets the maximum time a lock can be held before being considered stale. The default is 3600 seconds (one hour). Now, this sounds like it might be a useful feature: after all, you don't want to let a lock file just hang around forever, right?

No! No! It's a terrible idea! If the "kill -0 $pid" test works, then you know the guy who created the lock is still around. Why on earth would you then consider it stale, forcibly remove the lock, and start doing your own thing? That's a course that's pretty much guaranteed to get you into trouble, if you consider that you've probably created the lockfile for a reason.

So we set "-hold=>0" to disable this amazing feature. The only way we want to break a stale lock is if its $pid is dead, and in that case, we can happily break the lock immediately, not after an arbitrary time limit.

    my $filename = shift @ARGV;
    my $lock = $lm->trylock($filename);
    if (defined($lock)) {

Instead of using $lm->lock(), we use $lm->trylock(), because we want to exit right away if the file is already locked. We could have waited for the lock instead using $lm->lock(), but that isn't what runlock is for; in the above cronjob example, you'd then end up enqueuing the job to run over and over, when (in the case of cron) once is usually enough.

        my $pid = fork();
        if ($pid) {
            # parent
            local $SIG{INT} = sub { kill 2, $pid; };
            local $SIG{TERM} = sub { kill 15, $pid; };
            my $newpid = waitpid($pid, 0);
            if ($newpid != $pid) {
                die("waitpid returned '$newpid', expected '$pid'\n");
            }        
            my $ret = $?;
            $lock->release;
            exit $ret >> 8;
        } else {
            # child
            exec(@ARGV);
        }

        # NOTREACHED
    }

The above is the part where we run the subprocess, wait for it to finish, and then unlock the lockfile.

Why is it so complicated? Can't we just use system(@ARGV) and be done with it? (Perl has a multi-argument version of system() that isn't insecure, unlike in C.)

Unfortunately not. The problem is signal handling. If someone kills the runlock program, we need to guarantee that the subprocess is killed correctly, and we can't do that unless we know the subprocess's pid. The only way to get the pid is to call fork() yourself, with all the mess that entails. We then have to capture the appropriate signals and pass them along when we receive them.

The "# NOTREACHED" section simply indicates that that section of the code will never run, because both branches of the about if statement terminate the process. It's an interesting historical point, however: the comment "NOTREACHED" has been used in programs for years to indicate this. The practice started in C, but seems to have migrated to perl and other languages. I think it used to be a signal to the ancient "lint" program in C that it should shut up and not give you a warning.

    print STDERR "Still locked.\n";
    exit 0;

Finally the very last part of the program exits and returns a success code. We only get here if we didn't manage to create the lockfile.

It seems a little weird to return success in such a case, but it works: the primary use of runlock is in a cron job, and cron sends you annoying emails if the job returns non-zero. Since the fact that the previous run is still running is not considered an error, it works much better to return zero here.

If you use cron2rss, your captured output will include the "Still locked" message anyway.

runlock was originally written for my gitbuilder project.

2009-02-26 »