Tuesday, August 31, 2010

Locking daemons in bash

NOTE: Scripts in this post were tested in CentOS release 5.3, Ubuntu 14.04 and may not work in other Unix dialects. For example, flock(1) utility is absent from OS X 10.6.

TL;DR

Shell scripts designed to have only one copy running often require manual clean-up after an ungraceful shutdown or process crash. This solution provides a "self-cleaning" alternative. The full example script is posted on the gitHub page. Run two instances of the script to see it work. As a script writer you are still responsible for any other resource cleanup to achieve a proper startup.

Details

When writing a daemon shell script we traditionally use *.pid files to store the process ID for later use and to avoid multiple copies running simultaneously and potentially colliding. The common pattern of such use is:

self=$(basename $0)
lock=/tmp/${self}.pid
if [ -f ${lock} ]
then
    echo "Another copy of ${self} potentially running." >&2
    echo "Check the ${lock} lock and remove if necessary." >&2
    exit 1
else
    echo $$ >${lock}
    trap "rm -f ${lock}" EXIT
fi

The problem with this approach is that after a process or system crash the lock is stale and the new copy of the daemon will not start without manual cleanup.

Another approach, documented in the flock(1) man page and used by system programmers all along, is to use a process file descriptor locked exclusively on a “pid” file. In this case, the pid file becomes a bearer of the lock instead of being a lock itself. A file descriptor lock on a file exists only as long as the file descriptor is alive. When file descriptor owner processes die, the file descriptor is destroyed together with file locks associated with it. In essence, the process holds the lock in it's memory, not in file, and lock is cleaned by the operating system when the process quits or dies. That solves the crash problem of the first approach.

Let’s look at the code:

1:   pidf=/tmp/${self}.pid                 # Define lock file name
2:   exec 221>${pidf}                      # Open the file with descriptor 221
3:   flock --exclusive --nonblock 221 ||   # Attempt to acquire the lock
4:   {
5:       # Lock acquisition failure code   # Your custom error handler here
6:       exit 1                            # optionally exit the script
7:   }
8:   echo $$ >&221                         # Put the PID in lock file

After this block of code, the .pid file has a lock from the current process and it’s new children. Children receive the lock with file descriptors, which they inherit from the parent.

To release the lock, one can either close all file descriptors holding the lock, or use the /sbin/flock --unlock ... command to explicitly release the lock.

A user may check the presence of a lock on a file using the fuser(1) utility:

$ /sbin/fuser -v monitord.pid

                     USER       PID ACCESS COMMAND
monitord.pid:        auser    28576 F....  monitord
                     auser    28579 F....  ping

Note, that line 8 of the locking code which stores PID of the locking process in a .pid file is redundant, as the information can be retrieved for the lock itself. It is only stored in the file for convenience later in the code, when we need to use the PID to inquiry or manage the process. It is very important to note, that a mere presence of the .pid file in this arrangement does not mean, that the lock is active. It only records a PID of the process which currently holds the lock or was the last one to hold it.

To programmatically test for the presence of the lock we need to attempt to grab the lock. If we fail, then there was another lock already on the file. If we succeed, then there was no lock on the file. In any case we need to close the file descriptor to avoid inadvertently holding the lock ourselves.

if flock --exclusive --nonblock 232 232<${pidf}
then
    echo "Open"
else
    echo "Locked"
fi
exec <&232-

This testing technique is only good to test the presence of the lock. It is more convenient to use the fuser(1) utility to send KILL or other signals to locking processes, like this:

/sbin/fuser -k ${pidf}

The fuser utility will send the kill signal to all processes with file descriptors holding the lock, which should take care of the parent and children processes, if any.

There is more to graceful daemon writing in bash. Other topics are logging with rotation, sleeping, and configuration. Be mindful that these issues do not magically disappear when using bash - you need to address them just like you would when writing any other program.