Thursday, October 29, 2009

Spin Locks

In my post Multithreading And Atomic Integers I briefly introduced atomic integer operations and GCC's built-in functions for them. In this post and the next I want to discuss spin locks that can be implemented using the GCC built-ins.

First, consider a very simple implementation of a spin lock:

class SpinLock
    SpinLock() : mWord(0) {}
    void Lock() { while (!__sync_bool_compare_and_swap(&mWord, 0, 1)); }
    void Unlock() { mWord = 0; }
    int mWord;

To use this SpinLock, allocate one that is visible to all threads, and then in each thread do something like this:

extern SpinLock gLock;
void Foo()
    ... critical section here

This SpinLock class allows only one thread into the critical section at a time. It does it using just one bit of an atomic integer -- the code would work fine if mWord was declared as a char. If more than one thread wants into the critical section, one will be allowed access, and the others will spin in a very tight loop. Each spinning thread will burn CPU cycles (and memory bandwidth) until it acquires the lock.

It's worth spending a moment here to consider when spin locks are appropriate. One rule of thumb is that you should only use spin locks if you are willing to limit the number of active threads to the number of cores on your machine. Futhermore you should know that your application is going to be the only process consuming significant CPU cycles on the machine. If you use spin locks, you want it to be highly improbable that the kernel will suspend a thread while it holds a spin lock.

If you are willing to limit the number of active threads to the number of cores, you probably have threads that are primarily CPU bound. If the threads are primarily I/O bound, you are probably better off using mutexes and creating more threads than cores. But note that I am not saying that your application can't do I/O. I have an application that reads a large volume from disk yet is clearly CPU bound. It uses one thread per core and all threads add data into shared data structures. The application performs much better (as measured by the rate that it consumes the data) when using spin locks instead of mutexes.

As a final consideration, if you are going to be able to use spin locks, you probably are using lots of them. If you have only one global mutex, you don't want to replace it with a spin lock. Spin locks work well when they are fine grained and dispersed throughout a large data structure. Perhaps you have a large graph data structure and need a critical section whenever you update any link of this data structure. Each link would then need its own mutex or spin lock. If you have millions of these links then spin locks can be a great choice. The probability of contention on any one link is probably very small and it is likely that the critical sections are short in term of CPU cycles. If you tried to use a Posix mutex for every link, you might find that the memory footprint is prohibitive -- one Posix mutex requires 40 bytes on the Linux platform I use. A spin lock can use as little as one byte.

Let's return to the implementation above. There is one subtle problem with this implementation: if multiple threads contend for the same lock, there is nothing to ensure fairness. Suppose thread A obtains the lock and starts doing the work of its critical section. While it is doing the work, first thread B, then thread C attempt to obtain the lock. Since B was 'first in line', it would be unfair for C to be the thread granted the lock next, but that quite possibly will happen.

If this kind of fairness is a serious requirement for a given application then that application probably shouldn't be using spin locks. But we can implement a spin lock class that ensures fairness. Here is one possibility:

class FairSpinLock
    typedef unsigned char Byte;
    FairSpinLock() : mReserved(0), mReleased(0) {}
    void Lock() { Byte resv=Add(&mReserved); while (Add(&mReleased,0)!=resv); }
    void Unlock() { Add(&mReleased); }
    static Byte Add(Byte* val, Byte x=1) { return __sync_fetch_and_add(val, x); }
    Byte mReserved;
    Byte mReleased;

Here we use two different atomic integers. Both are single bytes and we only use the Add operation, which is defined using __sync_fetch_and_add(). The first byte mReserved is a running count of how many reservations have been made. The second byte mReleased is a running count of how many acquired locks have been released. A thread holds the lock when it finds that all prior reservations have been released. It holds the lock until it releases its reservation.

There is beauty in the simplicity of this implementation. Note that both counters wrap around. But wrap around is not a problem since we only check for equality. Using bytes means we are limited to no more than 255 pending reservations. Just make sure you don't use this code as is on your machine with 256 or more cores.

So far we have only considered simple mutual exclusion. But there are cases where you want to allow multiple readers as long as write operations are excluded. These are called read/write locks. A write lock is a mutual exclusion lock as we have discussed so far. While a thread holds a write lock, no other thread can obtain either a read or a write lock. A read lock excludes other threads from obtaining a write lock, but allows other threads to obtain concurrent read locks.

So, can we implement read/write spin locks? Yes, we can, but I'm going to save that for a future post. It is a fun exercise to try to implement a read/write spin lock, so you might want to try before reading my follow-up post. First start with the FairSpinLock implementation and extend it, while keeping the fairness feature. Then see if you can implement a read/write lock using a single atomic integer instead of two. I don't know how to do this while still ensuring fairness, but I have a working implementation with a single atomic integer that does not ensure fairness. I'll discuss both implementation in the follow-up post.

No comments:

Post a Comment