Friday, October 23, 2009

Multithreading and atomic integers

One of the challenges for programmers today is taking advantage of machines with many CPU cores. Two-core machines are so common now that odds are good that your laptop computer has two cores. Most server-class machines already have at least four cores. Machines with 16 or 32 cores are being widely deployed and it is likely that in the near future we'll see machines with considerably more. Consider that Intel demonstrated 80-core processors two years ago!

Writing applications that take full advantage of many cores is difficult. If you can write your applications such that an N-core server runs N CPU-bound processes with no interprocess communication required then you can probably keep all N cores busy. There are applications like this -- probably anything that runs on BOINC or comparable loosely coupled grid services is in this category -- but I'm more interested in applications that aren't as easily decomposed into many independent processes. Instead, let's assume we are developing applications that run as a single process with at least one thread per core and the threads must work with shared resources, in particular shared data structures in memory. Here we need various mechanisms to assure the integrity of the shared data structures.

The most common mechanism for assuring data integrity is the mutex. In POSIX there is the pthreads API, which includes functions for working with mutexes. With this API you can guard a critical section of code like this:
... // critical section here

If two or more threads attempt to execute this code simultaneously, the kernel will insure that only one thread can be executing the critical section at a time. The first thread to acquire the lock is allowed to execute the critical section, and while it holds the lock, any other thread that attempts to acquire the lock will be blocked, i.e. the scheduler will stop execution of the thread and allow some other thread that is ready to execute to continue.

Suppose your critical section takes only on the order of a hundred CPU cycles to execute and the section is executed on the order of thousands of times per second. Let's assume the cost of acquiring an unlocked mutex is negligible, but that there will be significant contention for the mutex. Perhaps there is a 10% chance that when a thread tries to acquire the mutex it will be blocked and the kernel will do a context switch. When this is the case, the overhead introduced by the mutex can be huge. It's even possible that an application running on multiple cores might not run much faster than if it was rewritten to run single-threaded without mutexes.

Consider the case where the only operation performed in the critical section is to increment a global counter, but that counter may be incremented hundreds of thousands of times per second. In this case, the overhead of using a mutex is quite possibly unacceptable.

But perhaps you are wondering if just incrementing an integer requires any kind of synchronization? Let me just give the quick answer that if you are programming in C/C++ and only have POSIX APIs available to you, then yes, you absolutely do need a mutex. Suppose we have this function:
void foo()
extern int gCounter;

The code your compiler will generate is most likely three separate operations:
  1. Load from RAM into CPU register
  2. Increment CPU register
  3. Store from CPU register into RAM.

Suppose two threads running on 2 cores of a multicore processor both perform these operations simultaneously. What happens? We want the integer to be incremented twice, but it is entirely possible that when both threads have completed these three operations, the value in memory is only one count larger than it was before.

The problem is that the three operations are not performed atomically. It is certainly possible for a CPU designer to provide a single instruction that performs the three steps atomically (and as we'll see, they have), but such instructions will be more expensive to execute than the three primitive operations. So, modern CPUs typically provide both kinds of instructions. However, the C and C++ language definitions don't provide a standard way to access the atomic operations, and POSIX does not define an API for atomic integer operations.

Of course, most compilers provide various extensions, such as the ability to write inline assembler, but these extensions are nonportable. One ramification of the non-portability is that they aren't documented in common sources of documentation. Consider that your humble 4 Dan only just recently stumbled upon the atomic builtins introduced in GCC 4.1.1. For example, there is this operation:
type __sync_add_and_fetch(type* ptr, type value);
For type, GCC will allow any integral scalar or pointer type that is 1, 2, 4 or 8 bytes in length, though the documentation is not entirely clear on the restrictions that might apply on different CPU architectures. I can confirm that the built-in performs correctly on 64-bit Intel Xeon and AMD Opteron processors using the long integer (64-bit) type. I wrote a relatively simple stress test to confirm this, which might be the subject of an upcoming post.

So, to be clear, the builtin behaves as if there are functions declared as
inline char __sync_add_and_fetch(char* ptr, char value);
inline short __sync_add_and_fetch(short* ptr, short value);
inline int __sync_add_and_fetch(int* ptr, int value);
inline long __sync_add_and_fetch(long* ptr, long value);
Or if you are comfortable with C++ templates, like this:
template <typename T>
inline T __sync_add_and_fetch(T* ptr, T value);

So what does __sync_add_and_fetch() do? Well, as you probably guessed, we can use it to do an atomic increment:
int result = __sync_add_and_fetch(&gCounter, 1);
This is an operation we can safely do from several threads on a multicore machine, and know that gCounter will accurately reflect the actual number of operations performed. Note that __sync_add_and_fetch() returns the result of the increment. If the code you are executing in your thread needs to know the current count after the increment, it should do as I've shown here, and not simply access gCounter again. To illustrate why, suppose we need to take some action every thousand increments. Assume we write code that looks like this:
__sync_add_and_fetch(&gCounter, 1);
if ((gCounter % 1000) == 0)

This code won't behave as we intend. The function DoActionOnRoller() might not get called when it is supposed to, and it might get called more than once. But let's rewrite it this way:
if ((__sync_add_and_fetch(&gCounter, 1) % 1000) == 0)

Now we can be sure that DoActionOnRollover() will get called once and only once per 1000 increments of gCounter.

There is quite a bit more interesting material here, but this post is already long enough. Future posts will cover some performance measurements and other things you can do with the GCC Atomic Builtins.


Mohammed Shahid said...

Very helpful, 4 Dan. I was working on a few lightweight mutual exclusion algorithms and wanted to implement them in working systems. Is it possible to know how the mutual exclusion in GCC 4.1.1 is implemented ? Any updates on the topic here would also be gladly appreciated.

Jim said...

Hi Mohammed. GCC doesn't provide mutual exclusion. You'd need to use the Posix mutex API, or perhaps some other API. You ask "is it possible to now how mutual exclusion ... is implemented?". If you want to know how it is implemented in GNU C Library implementation of Posix then the source code is available. But perhaps you really want to know how to implement mutual exclusion using GCC atomic primitives? My post on spinlocks ( shows one way, but pay careful attention in that article to my notes on when to know if a spinlock is appropriate. In most cases, you want to use a mutex that will suspend the thread if it can't immediately acquire the lock.

Anonymous said...

Interesting post. It doesn't help fix my 'undefined symbol __sync_add_and_fetch' problem, but it does explain the background of what's going on and why it is needed. Thanks!

Post a Comment