Thursday, December 7, 2006

Function Static Variables in Multi-Threaded Environments

Consider the following function:

void foo()
{
static int x = calcSomething();
}

It seems simple enough, and it is. The static variable will be initialized once, based on the result of the calcSomething function. With non-volatile constant values, the compiler can optimize the generated code to use the memory address of the value. In this case, where a function is called, a function we know nothing about might I add, it doesn't necessarily have that luxury. Looking at the generated assembly code, you'll see something like this:

mov eax,1
test byte ptr [$S1],al
jne foo+1Dh
or dword ptr [$S1],eax
call calcSomething
mov dword ptr [x)],eax

Loosely translated to pseudo C++, this will be:

void foo()
{
static bool x_set = false;
static int x;
if(!x_set)
{
x_set = true;
x = calcSomething();
}
}

As you can see, there's no interlocking code here. This essentially means that the function will be anything but thread safe. One thread may reach, but not execute, x_set = true, only to be swapped out in favor of another thread that does the same. The result would be that calcSomething is executed two or more times—which is likely to be a bad thing.

That's it for the recap. Now, if you'd like to fix this problem, what comes to mind? Interlocking, obviously. One very simple way to do this interlocking is to use the InterlockedIncrement function, which is guaranteed to be a synchronized operation, such as this:

void foo()
{
volatile static long initialized = 0;
static int x;

if(InterlockedIncrement(&initialized) == 1)
{
x = calcSomething();
}
initialized = 1;

// ... Do whatever with x
}

The interlocked increment will assure that no two threads or calls will have the same value of 1 returned. This, of course, means that as long as initialized is never reset to 0, calcSomething will be called only once. And, that's what we wanted, right?

And here's the disassembly:

push offset initialized
call dword ptr [__imp__InterlockedIncrement@4]
cmp eax,1
jne foo+1Ah
call calcSomething
mov dword ptr [x],eax
mov dword ptr [initialized],1

First, the return value of InterlockedIncrement is compared to 1. If it's equal, which means that this was the first run, calcSomething will be called, and x will be initialized to its return value. The assignment of 1 to initialized at the end is there to assure that you are never to overflow the size of the long. If you don't expect the function to be called more than 2147483648 times, feel free to drop this.

This shown approach also can be used if you want to have a local critical section function. Instead of the static int x, you could have a static CRITICAL_SECTION variable. Swap x = calculateSomething with InitializeCriticalSection(&cs), and you'd be home free.

As a final exploration; why does InterlockedIncrement work the way it does? It's all very simple. Consider the following disassembly:

mov ecx,dword ptr [esp+4]
mov eax,1
lock xadd dword ptr [ecx],eax
inc eax
ret 4

The interesting part here is the xadd lock. The xadd operation exchanges the content of the destination (whatever ecx points to) with that of the source (eax), and then adds the two together and places the result in the destination.

When this function is run in the example above, ecx will point to the static variable called initialized. When the function is first run, initialized will be 0. When the xadd is executed, the 0 ecx points to, and the 1 in eax will change place. They will be added, which yields 1, and put back where ecx points. At this point, eax will be 0. Finally, eax will be incremented, and the function returns. If you look back at the previous disassembly, namely cmp eax, 1, you will see that this evaluates to true for the first run—and so calcSomething will be called. For the second InterlockedIncrement, ecx will point to 1. When this is added with the 1 in eax, ecx will point to 2, which is also what's returned.

In a single processor or core environment, the xadd alone would be enough to do a "synchronized" increment. Only one instruction is needed to increment what ecx points to, as well as make a snapshot. If one thread manages to execute the xadd, then be swapped out, and another thread also does the xadd, the value ecx points to will be incremented twice, the first thread will have a 1 in its eax (after the inc), and the second thread will have a 2. Under no circumstances can the two threads end up with the same value in eax, and that's what you want to make sure that calcSomething is called only once.

In a multi-processor environment, you need the lock prefix to the xadd above. This statement makes the memory access synchronized as well. Without the lock, two processors, or cores, may very well execute the xadd at the exact same time, which could possibly lead to two threads with the same eax. With the lock in place, the two+ processors or cores may not access the memory at the same time. Only one thread will be allowed to preform the xadd instruction for the given piece of destination memory (such as the initialized variable), at any one time. Thus, you are assured that eax will have unique values for each call to InterlockedIncrement. And again, that's what you want.