Threads Basics

Author: Hans-J. Boehm,

First Published: Nov. 17, 2008; Last revised: January 16, 2011

Sequential consistency: A simple model
The problem with pure sequential consistency
Data races and a more restrictive model
Locks
Why we haven't lost anything important
Avoiding data races at higher levels of abstraction
A few Java specifics
A few C++0x specifics
Memory locations and C / C++ bit-fields
A few pthread specifics
Some more examples
A note on implementation of synchronization operations (optional)
Omissions and extensions
History and acknowledgments

A multithreaded program allows multiple separate programs, or threads to run concurrently, in a way that allows them all to update and access the same shared variables. Each thread has its own local variables. But all threads see the same global variables or "static" class members. And all threads can reference data accessible via those global variables (or via any references supplied to the thread through other means, such as arguments passed when the thread is started).

Threads have long been used to structure programs that logically deal with multiple independent input sources at the same time. A program that must provide both a responsive graphical user interface, and perform a computationally intensive calculation at the same time, typically chooses to perform the computationally intensive calculation in one thread, and drive the user interface from another, so that it can respond to user actions while the computation is progressing.

Separate threads can be executed on a single processor, by "interleaving" execution of the individual threads. The processor performs some steps corresponding to one thread, then performs some steps corresponding to another, and so on. Alternatively, each thread can be executed by its own processor core. Or each hardware processor could support multiple threads in hardware. For present purposes hardware threads are similar enough to multiple cores, which are similar enough to multiple processors on separate chips, that we don't make the distinction.

Since threads can run concurrently on more than one processor core, they are also increasingly used simply to take advantage of the performance of multicore processors. In other cases, they can serve to take advantage of other forms of hardware concurrency, for example, by having a thread continue to run while another waits for a disk drive.

Although threads usage is very widespread and growing, the basic rules for programming with threads, and particularly for accessing shared variables, have been confusing. Even "experts" have often advocated contradictory approaches. As a result of recent efforts to resolve these issues in languages such as Java, C++, and C, the rules are finally becoming clear. This is an attempt to explain the basics, both to experienced — but possibly confused — threads programmers, and to programmers who are only minimally familiar with threads.

We focus on those pieces that are more or less consistent across these languages, and only briefly mention some of the differences below. (We do not directly discuss C#, but we expect it eventually to follow a direction similar to Java.)

Our examples simply present the code executed by each thread. In a real application, those threads would be created with language- or platform-specific library calls.

In spite of the fact that threads are often used to take advantage of real hardware parallelism, we will mostly describe thread execution as interleaving the steps of individual threads, roughly corresponding to uniprocessor execution. To us, this is an easier model to explain. And programs built to be correct under this model will execute correctly on a genuinely parallel implementation. (Unfortunately, programs that are incorrect under this model may fail in much more mysterious ways on multicore implementations.)

Real parallel execution introduces additional issues and concepts for the programming language implementor (e.g. hardware memory consistency and memory fences). However, the programming model discussed here allows all of these to be ignored by the application programmer using threads. We strongly recommend that they in fact be ignored at this level, except in relatively rare cases with unusually stringent performance constraints, or possibly when the user has to accommodate deficiencies of existing implementations. The techniques required for such cases are only hinted at near the end of this paper.

Sequential consistency: A simple model

The execution of a multithreaded program can be viewed as follows:

Each step in the execution consists of choosing one of the threads to execute next, and then performing the next step in its execution. This process is repeated until the program as a whole terminates.

Thus effectively the execution of the whole program can be viewed as taking all the steps executed by each thread, and interleaving them in some way. Whenever an object (i.e. variable, field, or array element) is accessed, the last value stored by this interleaved sequence is retrieved as the value of that object.

Consider a program that first creates a second thread, and then executes in the two threads, where x and y are global variables of a suitable integer type, initialized to zero:

Thread 1	Thread 2
`x = 1; r1 = y;`	`y = 1; r2 = x;`

This can be executed by interleaving the steps from the two threads in many different ways. Here are three of the possible executions that together illustrate all possible final values of r1 and r2:

Execution 1	Execution 2	Execution 3
`x = 1; r1 = y; y = 1; r2 = x; // result: r1 == 0 and r2 == 1`	`y = 1; r2 = x; x = 1; r1 = y; // result: r1 = 1 and r2 = 0`	`x = 1; y = 1; r1 = y; r2 = x; // result: r1 = 1 and r2 = 1`

Although many other interleavings are also possible, it is not possible that both r1 and r2 are 0 at the end of an execution; any execution must start with the first statement of one of the two threads, and the variable assigned there will later be read as one.

(This presentation has been oversimplified in some ways. When we write down the interleavings, we expect each access of a shared variable to return the last value assigned to it in the interleaving. But clearly local variables don't play by the same rules, since each thread or function has a separate copy. Below we indicate that by renaming different instances of local variables in the interleaving.)

For our purposes, there is never a requirement that the interleavings be "fair" to all the threads. There is absolutely no guarantee on how quickly the individual threads progress. If two threads are executing a billion steps each, the implementation is allowed to execute all billion instructions of one of the threads before it executes any instructions of the other. Or it may execute all but the last instruction of one thread first, and delay that one instruction to the end. Or it may alternate steps from the two threads.

This kind of execution is normally referred to as a sequentially consistent execution. The actual execution model used by modern multithreaded programming languages deviates from sequential consistency in one simple, but important way, which we discuss next.

The problem with pure sequential consistency

Unfortunately, it is too expensive to guarantee sequentially consistent executions in all cases. For example, reconsider the program above. Most processors do not actually wait for the initial assignments to complete before executing the second statements in each thread. The initial assignments just cause the value 1 to be saved in a store buffer, where it waits to be copied to memory, but is not yet visible to another processor. If the two threads were to execute in lock step on different processors, it is quite likely that both r1 and r2 will be zero, because the reads of both x and y occur before the writes become visible to the other thread.

Similarly, compilers routinely transform code in ways that violate sequential consistency. For example, if our initial example occurred as part of a larger program, the compiler might decide to move one or both of the r2 = x and r1 = y "load" operations to the beginning of their respective threads, to give them more time to complete before the values in r1 and r2 are needed. Causing the load to happen early is essentially equivalent to the hardware delaying the store; it can again cause both loads to read zero. Since the two operations in each thread touch independent variables, there is no reason for the compiler to believe that they are not interchangeable. This kind of compiler transformation can produce significant performance improvements, and we do not want to disallow it.

In essence, both hardware and compilers may perform very similar optimizations that reorder memory references as perceived by other threads.

There has been a significant amount of research on reducing the cost of sequential consistency, both in hardware, and through more complete compiler analysis. But most experts feel that the hardware costs are still too high, and the compiler optimizations require too much information about the entire program to be generally feasible. And, as we will see below, insisting on pure sequential consistency generally does not provide us with any real advantage.

Data races and a more restrictive model

Most modern multithreaded programming languages avoid this tension between sequential consistency and performance by observing that:

This problem arises because certain program transformations, like interchanging the order in which unrelated variables are accessed in our example, never change the meaning of single-threaded programs, but do change the meaning of multithreaded programs (e.g. by allowing both r1 and r2 to be 0 in the example).
These transformations are detectable only by code that allows two threads to access the same shared data simultaneously, in certain conflicting ways, e.g. by having one thread write the data while another reads it. If the program forced the accesses to shared variables to occur in a certain order, we wouldn't be able to tell whether accesses to independent variables had been rearranged, for the same reasons that a single-threaded program cannot tell.
Unrestricted simultaneous access of ordinary shared variables makes it much harder to reason about programs, and generally needs to be avoided in any case. We get nothing useful in return for insisting on complete sequential consistency. We devote a separate section to this issue below.

Thus programming languages generally provide convenient mechanisms for limiting simultaneous access to variables by different threads, and guarantee sequential consistency only if no uncontrolled concurrent accesses, like the ones in our example program, are present. We can make this more precise as follows:

We say that two ordinary memory operations conflict if they access the same memory location (for example, variable or array element), and at least one of them writes to the location.

We say that a program allows a data race on a particular set of inputs if there is a sequentially consistent execution, that is an interleaving of operations of the individual threads, in which two conflicting operations can be executed "simultaneously". For our purposes, two such operations can be executed "simultaneously", if they occur next to each other in the interleaving, and correspond to different threads. If two ordinary memory operations corresponding to different threads occur adjacently in the interleaving, we know that they could equally well have occurred in the opposite order; if some operations enforced the order, they would have to appear in between in the interleaving. Thus adjacency in this model really means that they could have happened simultaneously in a model with true concurrency.

We then guarantee sequential consistency only when the program avoids data races.

This guarantee is at the core of the threads programming model for both Java and the next C++ standard.

Note that the example from the introduction does have data races, at least if the variables x and y are ordinary integer variables.

Most modern programming languages provide an easy way to specify that certain synchronization variables are used to communicate between threads, and are intended to be accessed concurrently. Concurrent accesses to variables declared in this way are not considered data races. Putting it another way, sequential consistency is guaranteed so long as the only conflicting concurrent accesses are to synchronization variables.

Thus we can guarantee that our initial example program behaves as expected if we make both x and y synchronization variables. In Java, this would usually be done by declaring them as volatile int.

Declaring a variable as a synchronization variable, in addition to ensuring that the variable is accessed indivisibly, prevents both the compiler and the hardware from reordering memory accesses in ways that are visible to the program. This is necessary to prevent the r1 = r2 = 0 outcome for our introductory example. Another, more common, example is the following, where only the flag x_init (initially false) is a synchronization variable:

Thread 1	Thread 2
`x = 42; x_init = true;`	`while(!x_init) {} r1 = x;`

The idea here is that thread 1 initializes x. In a real program, this may be more complicated than simply assigning 42 to it. It then indicates that it has finished by setting the x_init flag. Other threads, like thread 2 here, will wait for the x_init flag to be set, and then proceed knowing that x is initialized.

This program is data-race-free, in spite of the fact that x is an ordinary variable. Thread 2 is guaranteed not to progress to the second statement until the first thread has completed and set x_init. Thus there cannot be an interleaving of the steps in which the actions x = 42 and r1 = x are adjacent.

This means that we are guaranteed a sequentially consistent execution, which means that we are guaranteed that r1 = 42. In order to ensure that, the implementation has to arrange that the thread 1 assignments to x and x_init become visible to other threads in order, and that the r1 = x operation in thread 2 cannot start until we have seen x_init set. On many machine architectures, both of these require the compiler to obey extra constraints and to generate special code to prevent potential hardware optimizations, such as thread 1 making the new value of x_init available before that of x because it happened to be faster to access x_init's memory.

Locks

Most commonly, data races are not actually prevented by turning ordinary variables into synchronization variables, but by preventing concurrent access to ordinary variables. Normally this is done by the introduction of locks (sometimes called mutexes), which are themselves special synchronization variables provided by the language, or by a support library. For example, if we want to maintain a shared counter variable c that can be incremented by multiple threads, and read after the threads finish, we might introduce a corresponding lock l, and write something like:

void increment_c()
{
    l.lock();
    ++c;
    l.unlock();
}

The implementation of lock() and unlock() then ensures that at most one thread at a time is between the two calls, and thus at most one thread at a time can access c. We can think of this as restricting the execution interleavings so that, for a given lock l, l.lock() and l.unlock() calls alternate, and the initial such call is l.lock(). (Typically, but not always, there is an added requirement that a lock be released by the thread that acquired it.) Thus there cannot be a data race on c, even if increment_c() is called concurrently by several threads. Any two accesses to c in the interleaving must be separated by at least an unlock() call in the first thread, and a lock() call in the second thread.

To illustrate this, consider a program with two threads, each of which executes a call to increment_c(). The following is an acceptable interleaving:

l.lock(); ++c; l.unlock(); l.lock(); ++c; l.unlock();

This interleaving does not contain a race. The only adjacent operations corresponding to different threads are the middle two steps. These both operate on the lock l which is a synchronization variable, and hence cannot participate in data races.

If we tried to reschedule the operations to produce a data race, we might try an interleaving like:

l.lock(); l.lock(); ++c; ++c; l.unlock(); l.unlock();

But this is clearly prohibited by our rule that lock acquisitions and releases of l must alternate in the interleaving. In any acceptable interleaving, the second l.lock() call cannot appear in the interleaving until after the first unlock() call. Thus the two instances of ++c, which constitute the only potential data race, must always be separated by these two calls.

This illustrates another, usually equivalent characterization of a data race as a requirement that any two conflicting accesses to the same ordinary shared location by different threads must be separated by a synchronization between the two that enforces an ordering among them. Typically, this is done by releasing a lock in one thread and acquiring it another. But it may also be accomplished by setting an integer synchronization variable (e.g. volatile in Java) in one thread, and then waiting for the variable to be set (possibly with a simple loop) in the second thread. Language standards often define data races in this way.

Why we haven't lost anything important

At first glance, the restriction of sequential consistency to data-race-free programs seems like an unnecessary complication for multithreaded programming, which is certainly already a sufficiently difficult problem. However, there are other important reasons to discourage data races. In fact, this restriction affects almost exclusively programs that we genuinely want to discourage.

Consider a program like our initial example, with x and y interpreted as ordinary integer variables. Assume that we are expecting it to produce one of the three possible expected outcomes we listed earlier. This implicitly relies on the atomicity of certain operations. In our explanation above, we assumed that assignments like x = 1 happened "all at once". In this particular case, that's probably reasonable. But what if the assignment had been x = 1000000 instead? If x were a 32-bit integer on a 16-bit processor, this might have been translated to two hardware "store" instructions, one each for the upper and lower halves of the constant 1000000. Thus the other thread would not only be able to see x values of 0 and 1000000; it might also see "intermediate" values with half the bits in the binary representation updated. In this case, it might also see 983040 or 16960.

This issue appears in many other forms in real programs. In a C or C++ program, a single assignment w = z can copy an entire struct object. On all current hardware, this would usually be implemented as several assignments of the components, even on a 64-bit machine. But again a program that simultaneously reads w in another thread will depend on the "granularity" of the updates to w, i.e. the size of the pieces that are atomically (indivisibly) updated by hardware instructions.

If we avoid data races, and instead enclose operations we would like to be executed indivisibly by lock()/unlock() pairs, making sure that accesses to a given shared variable are always protected by the same lock, we avoid all of these issues. We've made the granularity of indivisible actions explicit.

In those rare cases in which it suffices for program correctness to guarantee indivisibility of individual variable accesses, we can turn the variable into a synchronization variable instead, putting the compiler on notice that it must perform the individual accesses indivisibly. If the variable is too large to be read or written indivisibly by the underlying hardware, this will either be impossible to express (e.g. Java provides no way to qualify large objects with volatile), or be emulated with lock operations (e.g. the next C++ standard).

Real programs generally rely on even larger indivisible operations. If a program maintains a shared ordinary integer counter c, as in the example from the last section, and were to try to increment it by simply writing ++c, the resulting counts would usually not be accurate, even if we assume that c can be read or written with a single indivisibly-executed instruction. The problem is that ++c is usually executed as

tmp = c;
++tmp;
c = tmp;

where tmp is a machine register, which can be thought of as a local variable.

On some machines, compilers generate this code. On other machines, they might generate a single instruction, but the hardware does not execute it indivisibly. In either case, the result is the same.

If we interleave two such counter increment operations performed by two different threads, we can get

tmp1 = c; // reads n tmp2 = c; // reads n ++tmp1; ++tmp2; c = tmp1; // writes n + 1 c = tmp2; // writes n + 1

effectively losing one of the updates.

Again, this can be avoided by acquiring a lock around the increment operation, as we did in our earlier example. In this case, the alternate solution of turning c into a synchronization variable may not be enough. Even if it is a synchronization variable, the increment may consist of two atomic accesses to c as above, instead of a single one to both read and write c, thus allowing the same interleaving as above.

Synchronization variables make it safe to access the variable concurrently from different threads, and require the implementation to behave as though the basic steps were just interleaved. But the size of those basic steps is a different issue. In the case of an increment operation, different programming languages resolve it differently.

In more complicated, and more typical cases, operations on more than a single memory location must be performed indivisibly. This is true, for example, if a thread adds elements to a container data structure, while another examines or removes them. In such cases, by far the easiest approach is to protect the data structure with a lock.

In all of these cases, simply guaranteeing sequential consistency for the original program is not sufficient; the programmer has to remove data races, sometimes in nontrivial ways, in order to ensure correct execution. In the process, the programmer must explicitly delineate the code sections that must be executed indivisibly - an essential piece of writing a parallel program.

Programs with data races almost never produce consistently correct results across various hardware and compiler platforms. (The literature sometimes talks about "benign data races". Most of them are not benign in this sense.) Thus it makes little sense to sacrifice performance to accommodate such programs that are almost certainly incorrect.

Even in the exceedingly rare cases in which such programs are in some reasonable sense correct, the correctness arguments tend to be extremely subtle. Avoiding the data race by identifying the affected variables as synchronization variables identifies the subtlety to both the human reader and the compiler.

To be fair, there is one case in which sequential consistency for programs with data races can be an advantage: When we are debugging a program with unintentional data races, we clearly want it to behave as predictably as possible. But even here sequential consistency is probably the second best behavior; ideally we would like the debugging environment to detect and stop at a data race.

Avoiding data races at higher levels of abstraction

We've defined data races in terms of memory locations that hold basic scalar values, like integers or individual characters. But most programs operate at higher levels of abstraction, invoking library routines to operate on data whose representation is hidden from the programmer. For example, we might use a library that implements sets, and our program may consist of calls that insert and remove elements from sets, and check whether a set contains a particular element. How can data races be prevented at that level?

In the general case, it is the library's responsibility to specify what combinations of concurrent calls will introduce data races. However, there is a common set of conventions that tends to be followed by most libraries not specifically designed for thread communication, e.g. by libraries that implement some kind of container abstraction, such as our set example:

Library functions perform writes that may introduce a data race only as part of an operation that logically causes the data structure to be updated. The library visibly writes data only when the call directly asked it to. If the implementation needs to write to a data structure as part of a logical read operation, such as when updating a cache that is not visible to the client, it needs to internally protect that data against simultaneous access. For example, it might hold an internal lock while accessing the cache.
As a result, concurrent read-only accesses to a higher-level data structure (e.g. concurrent membership tests in a set) are safe, just as concurrently reading the value of an integer variable from two threads is safe.
If client code may concurrently access a container in a way that might cause the container to be updated simultaneously with another access, then the client must protect accesses to the data structure with a lock. For example, if one thread inserts elements into a set s while another tests whether s contains certain elements, then the same lock should be held during both calls. This again is the same rule that we would apply if s were just an integer.

Thus we ensure that we can apply largely the same locking rules at this level as we apply at the level of basic memory locations. And, just as with basic scalar types, we can normally use the same data type both for data private to a thread, and for shared data, both with a minimum amount of locking. If a lock is required, we've pushed decisions about locking granularity to the authors of the client code, the only place they can reasonably be made.

This rule is not universally applicable: Programmers often expect, and many libraries ensure, that we can concurrently update distinct elements of an array-like data structure without introducing a race. Many parallel algorithms rely on this. But there are many other containers, e.g. those implemented as trees, for which an update often conflicts with any other update, and the whole tree really does behave as a single memory location.

Similarly, some other containers may be explicitly designed for inter-thread communication, and could not be meaningfully used if the client had to acquire locks around concurrent accesses. In such cases, the interface must explicitly specify that concurrent accesses are safe, even if they appear to conflict. For example a blocking queue, for which dequeuing an element blocks until the queue is nonempty, must provide for safe concurrent access. Otherwise a thread dequeuing an element from q would have to acquire a lock l. If such a thread blocked because q was empty, it would do so while holding l. This would prevent another thread from an adding an element to the queue, since it would be unsafe to enqueue an element without also holding l.

A few Java specifics

Threads in Java are commonly started by creating a subclass of java.lang.Thread with a suitable run() method, and then invoking the start() on an object of the subclass.

We previously defined data races in terms of simultaneous accesses to "memory locations". In Java, memory locations are just object fields or array elements, as expected.

Since Java is intended to support untrusted code running as part of a trusted application, it must limit the damage done by a data race in the untrusted code. Hence it cannot allow arbitrary behavior for data races. As a result, the Java language specification contains a complicated set of rules defining the behavior of objects shared between threads, including the behavior in the presence of data races. The consequences of these rules can be surprising even to experts. However, they do guarantee sequential consistency for data-race-free programs, and that is a far easier model to program to.

The Java definition of data races uses the alternate definition style we mentioned above. Conflicting operations must be prevented from occurring concurrently by either performing them from the same thread, or by introducing synchronization variables that enforce an ordering between threads. One memory operation is said to happen before another, if they are ordered using these mechanisms, and thus cannot occur adjacently in the interleaving. This is essentially equivalent to our definition.

In almost all cases, Java programs should be written simply to avoid data races, and rely on sequential consistency. There are really only three circumstances in which the additional guarantees for data races matter:

For compiler writers, who must preserve them.
For authors of particularly security sensitive code who need to limit the damage that could be caused by untrusted "sandboxed" code.
For very sophisticated authors of extremely performance sensitive code, for which the additional cost of using synchronization variables is too high. Some such code is embedded in the java.util.concurrent library, but we expect very few other programmers to write it.

Java provides locks in an unusual way: Every Java object can act as a lock, i.e. it logically has an associated lock. Rather than providing explicit lock() and unlock() operations, Java provides synchronized blocks (and methods) to acquire and release a lock, holding it while a specified block of code is executed:

synchronized (object_used_as_lock) {
code to be executed with lock held
}

Although recent Java versions also provide explicit locking operations (in java.util.concurrent.locks), synchronized blocks have the substantial advantage that the lock is released along all paths out of the block, including when an exception is thrown, thus removing a common source of errors.

As we mentioned above, synchronized variables, or more accurately object fields, are usually declared using the volatile keyword. Since this is not a separate type, it has some, possibly surprising, consequences:

Array elements cannot be used for synchronization, since there is no way to declare an array element volatile.
As we hinted earlier, volatile affects only individual memory accesses. If i is declared as volatile int i, then ++i consists of two separate indivisible memory accesses, i.e. the increment as a whole is not indivisible.

The java.util.concurrent.atomic package provides some "atomic" types that can be used circumvent both of these.

Java.util.concurrent provides a number of other facilities to support multithreaded programs, including a substantial library of components used to synchronize or communicate between threads.

A few C++0x specifics

Following established convention, we will use the term C++0x to refer to the next C++ standard, in spite of the fact that we do not expect it to be formally adopted as an ISO standard until 2010 or 2011. A Committee Draft of the standard is currently available, and we expect a number of vendors to begin support for parts of it sooner than 2011. All our comments here apply specifically to the 2008 committee draft.

Unlike the current (2003) C++ standard, C++0x contains explicit support for threads in the language. Threads are created by creating an instance of the std::thread class, passing the constructor a function or callable object to execute.

Since C++ is not designed to provide protection against untrusted code, it guarantees nothing in the event of a data race. Any program allowing a data race produces "undefined behavior".

In the absence of data races, and provided that certain low-level library facilities are not used (see below) it again provides sequential consistency. Also like Java, this is not immediately apparent from the official language description. In this case the language description is made more complicated not by the treatment of data races, but by those low-level library facilities.

A lock can be acquired by constructing a mutex, typically of type std::mutex, and then acquiring it by passing it to the constructor of a std::lock_guard object. The lock_guard destructor will release the lock, ensuring that, as in the Java case, it is released even if an exception is thrown. Thus typical code to acquire a lock looks something like:

#include <mutex> std::mutex mtx; // The lock; shared by multiple threads ... {std::lock_guard _(mtx); ++c; }

In C++0x, an integer synchronization variable i might be declared as

atomic<int> i;

Synchronization variables have a different type from ordinary variables, and can thus provide different implementations of member functions. If i is declared as above, the implementation ensures sequentially consistent behavior for concurrent accesses as before, but it also ensures that ++i will atomically increment i, as a single indivisible operation.

(Volatile also exists in C++. For historical reasons, it means something else there.)

It currently appears that the C++0x model will be followed in some other environments as well. Current indications are that the next C standard will follow an approach similar to C++0x. OpenMP also appears to be moving towards a similar solution.

Memory locations and C / C++ bit-fields

Our definition of conflicting accesses, and indirectly that of data races, referred to simultaneous accesses to "memory locations". We saw that in Java, a memory location is just a scalar variable, object field, or array element, i.e. the smallest unit that can be separately updated by the program. Consider the following example, where x is a structure containing two small integer (e.g. char in C) fields a and b:

Thread 1	Thread 2
`x.a = 1;`	`x.b = 1;`

Here the two assignments to the a and b fields usually do not conflict, and hence there is no possible data race.

C++0x (and presumably future versions of C) have to make one exception to this rule. They allow structures to contain bit-fields that often do not consist of an integral number of bytes. Mainstream hardware does not allow such fields to be efficiently updated without overwriting adjacent fields. If a and b in the above example were small bit-fields, thread 1 would typically be compiled to

tmp = x; tmp.a = 1; x = tmp;

i.e. the whole structure would be read and written back. This does conflict with the thread 2 assignment, especially since it would be translated similarly. Thus C++0x treats all (non-zero-length) bit-fields in a contiguous sequence of bit-fields as part of a single memory location; an assignment to one conflicts with an assignment to any of the others. Such contiguous bit-fields should not be updated concurrently by different threads. Usually that means they should be protected by a single lock.

A few pthread specifics

Unfortunately, no C++0x implementations are currently available. Anyone who currently desires to use threads with C or C++ will use a single-threaded language, together with a library allowing thread creation and providing locking operations, etc.

The two most common threads library interfaces are Microsoft's Windows threads and Posix threads. We concentrate on the latter, since it is described by a decade-old international standard.

Threads are created by calling pthread_create() with the function to be executed by the new thread.

This whole approach based on an add-on threads library is not entirely satisfactory. Without a real definition of the programming language in the context of threads, it is unclear what compiler transformations are legal, and hence what the programmer is allowed to assume.

However, the intent of the Posix standard was apparently close to the "sequential consistency for data-race-free programs" we have described here. In particular, it requires that "Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it." Thus any application with a data-race has undefined behavior, as in C++0x.

There are some uncertainties and ambiguities in this definition of data race:

It is unclear whether Posix' notion of simultaneous access is the same as ours. As we will see in the next section, compilers that claim to be Posix conforming sometimes use a subtly different interpretation.
We will nonetheless continue to assume our definition of data races. The cases in which real compilers noticeably diverge from this rarely introduce real problems. And we have been unable to devise useful programming rules that take these cases into consideration. We believe that in the slightly longer term, fixing the offending compilers to enforce C++0x-like rules is both the only acceptable option, and the current trend.
Posix intentionally leaves the meaning of "memory location" undefined. In particular, it clearly allows the two threads in our struct example above to interfere with each other. Fortunately, few implementations take full advantage of this freedom. Current C and C++ implementations often do generate code that reads and rewrites adjacent small integer fields when bit-fields are updated. For example, it is common to read and rewrite the a field in a
struct {char a; int b:5; int c:11; char d;}
structure, when either the b or c fields are updated. As a result, sequences of bit-fields should be separated into a separate substructure, if they might be accessed concurrently with adjacent non-bit-fields. For example, the above might be rewritten as
struct {char a; struct {int b:5; int c:11;} bc; char d;}
This will usually both avoid the problem and increase the size of the structure.
In a few other cases, small fields, particularly char fields, may be treated essentially like bit-fields. An update may overwrite adjacent fields and introduce a data race. To our knowledge, the only example of this are compilers that generate code compatible with the first generation of Alpha processors. In this instance it can, and generally should, be avoided with compiler options that assume a more recent target machine.

We believe the intent of the Posix standard is to guarantee sequential consistency for applications that do not allow data races, at least in the absence of error returns from functions in the threads library.

Since pthreads were designed as an add-on library for C, they use function call syntax for lock acquisition and release. A typical code sequence would declare a lock as

#include <pthread.h>
pthread_mutex_t mtx = PTHREAD_MUTEX_INITIALIZER;

and then use it as

pthread_mutex_lock(&mtx); ++c; pthread_mutex_unlock(&mtx);

Unfortunately, neither Posix, nor the underlying languages, support mechanisms for turning, for example, an integer variable into a synchronization variable. Current platforms often do support approximations that are neither terribly well-defined nor consistent with our "sequential consistency for data-race-free programs" view.

On some platforms, C or C++ volatile declarations have characteristics that sometimes makes them usable for thread communication. But unlike Java, C's volatile was not originally intended for this, and it has become clear that it is not feasible to extend it to perform this function reliably across platforms.

Gcc and Microsoft compilers support atomic operations, whose names start with "__sync_" and "Interlocked" respectively, to atomically update shared variables. A number of other compilers provide similar or identical support. Neither makes the corresponding atomic read or load operations explicit, making it impossible to provide a portable guarantee of sequential consistency. Some of these primitives also violate our guarantees in other ways. Combining these primitives with (possibly volatile) load followed by a memory fence such as __sync_synchronize() should usually have the desired effect. (Note that if an ordinary, or even volatile store is used to assign to such a shared synchronization variable, it may need to be both preceded and followed by a fence in order to ensure something like sequential consistency and correctness of our introductory example.

We strongly recommend using C++0x style atomic variables as soon as they become available. The libatomic_ops package may provide another stopgap option.

Some more examples

Our definition of data race is rather stringent: There must be an actual way to execute the original, un-transformed program such that conflicting operations occur in parallel. This imposes a burden on compilers not to "break" programs by introducing harmful data races.

Consider, for example, the following program, where x and y are ordinary, non-synchronization, variables that are initialized to zero.

Thread 1	Thread 2
`if (x != 0) y = 1;`	`y = 2;`

This example does not allow a data race. There is no sequentially consistent execution of this program in which Thread 1 assigns to y, since x never becomes nonzero.

The same observation applies to

Thread 1	Thread 2
`if (x != 0) y = 1;`	`if (y != 0) x = 1;`

This program is also data-race-free.

(It appears unclear and controversial whether either of these examples are viewed by Posix as containing a data race. We avoid the ambiguity by insisting that data races must occur in sequentially consistent executions in order to be considered.)

However, if we transform the first example in this section to the following apparently similar program, the result does allow a data race:

Thread 1	Thread 2
`y = ((x != 0)? 1 : y)`	`y = 2;`

Although the two programs for thread 1 are equivalent without threads, they are not equivalent with threads. In the new version, y is assigned to unconditionally. In particular, it may be written at the same time as thread 2 executes. As usual, this data race is indicative of a likely problem. Since thread 1's code would usually not be executed indivisibly, this program may be executed as, for example,

tmp = ((x != 0)? 1 : y); // tmp = 0 y = 2; y = tmp; // y = 0

This is probably an unexpected outcome, not possible with the initial example in this section. Hence a compiler is not allowed to transform the code as we have suggested here, in spite of the fact that hardware conditional move instructions often make the last version faster, or it might allow a containing loop to be vectorized. If the faster version was intended, it will either have to be written explicitly, or the compiler must be able to determine that the variables involved are not shared.

Some more common compiler optimizations also potentially introduce data races. Consider a loop (written using C syntax) that counts the number of negative elements in a list, where the variable count is global or a static class member:

void count_neg(T *q) {
  for (T*p = q; p != 0; p = p -> next;) {
    if (p -> data < 0) ++count;
  }
}

This code clearly introduces a data race if a second thread can access count while it is being updated by this loop.

The compiler might be tempted to keep the counter temporarily in a local (register) variable, and transform the code to:

void count_neg(T *q) {
  int local_count = count;

  for (p = q; p != 0; p = p -> next;) {
    if (p -> data < 0) ++local_count;
  }
  count = local_count;
}

But this may introduce a data race. To see this, consider a program that sets pos_list to a list containing the data fields 1 and 2, and declares count as a global variable initialized to zero, and then executes:

Thread 1	Thread 2
`count_neg(pos_list);`	`count++;`

The original version of this program does not contain a data race, since pos_list contains no negative elements, and thus the count variable is neither read nor written by thread 1.

(Some people feel that this definition of data race is unnecessarily demanding, in that it is phrased in terms of specific executions, and hence takes into consideration the specific arguments passed to functions. But, especially for real programs with pointer arguments, this is the only definition that makes sense. Otherwise any function that writes through a pointer would introduce a data race with essentially everything.)

The transformed program unconditionally writes to count and thus introduces a data race.

(Although this transformation is clearly illegal in Java and C++0x, its status in Posix is unclear, and it is commonly performed by many current (2008) C compilers.)

Note that, at least in the absence of exceptions, it is allowable to transform the code to

void count_neg(T *q) {
  int local_count = 0;

  for (p = q; p != 0; p = p -> next;) {
    if (p -> data < 0) ++local_count;
  }
  if (local_count != 0) count += local_count;
}

A note on implementation of synchronization operations (optional)

Most hardware unfortunately does not make a distinction between data and synchronization operations, and provides semantics for concurrent memory accesses that are quite different from what we have proposed here, and that vary greatly from one hardware platform to another. Especially for those readers who have some familiarity with hardware memory models, we briefly suggest how ordinary memory operations and synchronization accesses might be compiled to a hypothetical machine (with some resemblance to recent ARM processors, but Itanium assembler syntax).

We emphasize that this description is very imprecise, insufficient for compiler writers, and not part of our description of the programming model. It is included here only in the hope that we might help reconcile the presentation in the rest of this paper with the reader's preconceptions of the subject.

Modern processors generally rely on cache memories or just caches to provide fast access to frequently accessed memory locations, by keeping local copies of those locations in much faster memory, typically associated with each processor. Instead of accessing main memory directly, a processor ensures that a copy of the right section of memory is in its cache, and then accesses the cached copy. Sometimes, for example when it runs out of room in the cache, some of the cache contents are written back to main memory.

Although these caches are often blamed for complex memory ordering models at the hardware level, this blame is somewhat misplaced. The hardware uses a cache coherency protocol, such as the MESI protocol to keep the contents of all the caches consistent. Normally this works by ensuring that a location in a cache can only be updated when no other processor has a copy of that location in its cache. Thus at this level, all processors share a consistent view of memory, and all memory operations are often performed almost as though operations from the different threads are interleaved. (For a much more thorough description of memory systems and caching, see for example Ulrich Drepper, "What Every Programmer Should Know About Memory".)

However, processors typically perform other kinds of optimizations that have the effect of reordering memory accesses, so that ordinary memory accesses no longer suffice for implementing synchronization operations. For example, processors generally don't need to wait for writes to complete before proceding with later computation, and so buffer incomplete writes and move on. When a subsequent load (from a different address) completes while the write is still buffered, then the reordering can become visible. This can result in the wrong outcome for something like our introductory example, if x and y in the example are synchronization variables. (For other such optimizations, see Adve and Gharachorloo, Shared Memory Consistency Models: A Tutorial, IEEE Computer, 1995.)

In order to address this issue, processors typically also provide memory fence instructions. These serve as markers to the processor to inhibit certain optimizations for specific instructions. For our hypothetical hardware, we will consider a generic fence that ensures that a memory operation appearing before the fence becomes visible to all processors before any that appears later.

A simple implementation of our generic fence would be for a processor to block at a fence until all prior instructions have completed and are visible to the entire memory system. No instruction beyond the fence is allowed to proceed until this happens. In reality, current processors provide a wide range of fences that may ensure an ordering only among a subset of operations, or even require reasoning about operations of multiple processors to determine their semantics - here, we consider only the simplest fence as described above.

To make this more precise will assume that our illustrative hardware architecture has the following characteristics:

It has load and store instructions, with suitable data sizes. We will write ld ri = [x] for the instruction that loads the contents of location x into register ri. We will write st [x] = ri that stores the contents of register ri into location x.
Ordinary loads and stores require that the accessed data is suitably aligned in memory, e.g. 4-byte values are stores on 4-byte boundaries. In return, they are executed indivisibly, i.e. they appear to other threads as having been executed completely, or not at all. (Architectures that do not require such alignment usually cannot guarantee this kind of atomicity, since a single load or store instruction may require multiple accesses to physical memory.)
Load and store instructions can be arbitrarily reordered by the hardware, in the sense that they may become visible to other processors or threads in arbitrary order.
When the results of a store instruction becomes visible to another processor, it becomes visible to all of them. (This is typically true for memory systems similar to what we described above, and is guaranteed by, for example, the X86 architecture specifications. It could be invalidated if, for example, the other thread on a core supporting two threads could see values in the write buffer before the rest of the world.)
The processor supports a fence instruction, which waits for completion of prior memory operations, as described above.
The processor supports an xchg ri, [x] instruction that atomically exchanges the values in register ri and location x. Although this exchange occurs indivisibly, it is not ordered with respect to other operations.

This description is far too sloppy to serve as the basis of a real implementation, but it hopefully conveys the right intuition.

Note that typically the fence instruction, and possibly the xchg instruction are significantly more expensive than ordinary loads and stores. The intuition that it has to wait for prior pending memory operations is often not that far off.

A typical implementation on such a machine would generate code as follows, while also making sure that the compiler itself performs no code reordering that would violate sequential consistency. These are general translations, and the compiler may be able to do better in specific cases:

Simple loads and stores

Ordinary memory accesses are compiled to ordinary load and store instructions. This is usually essential across real architectures, since anything else would involve substantial slowdowns relative to single-threaded code.

As we pointed out with some of the examples above, it is critical that the compiler not introduce stores to memory locations that might not have been modified in the original source, even if those stores store a value just read from the same location. Introducing such a store generally introduces a data race that can lead to incorrect results.

A more subtle point is that a compiler may, and often does, introduce loads that introduce data races, provided the result isn't used. For example, if the value of a variable is used repeatedly in a while-loop, it is usually fine to load it only once into a register ahead of time, even if the loop body sometimes does not execute at all. This does sometimes introduce a data race. And hence it is not acceptable to transform C++ programs in such a way, if the result is another C++ program. However a compiler usually generates machine code that does give well-defined meanings to programs with data races. Hence such a transformation, on the way to machine code, usually preserves meaning, and is acceptable. (An exception to this rule might be a currently hypothetical, but very desirable, machine that detects data races.)

Subject to the restriction on newly introduced stores, reordering ordinary loads and stores in ways that would be correct in a single-threaded program does not violate sequential consistency for data-race-free programs, so long as these operations are not moved across synchronization operations. Some movements past synchronization operations are also fine, but more complicated rules apply there. For languages like Java that make semantic guarantees for programs with races, the actual transformation rules are more complicated.

Synchronization loads

Synchronization loads are compiled as an ordinary load followed by a fence. To see why this is necessary, consider code that first waits for a shared synchronization variable to be set, and then accesses some shared data, as in our prior lazy initialization example:

declare x_init as synchronization variable;
while (!x_init);
access x which was previously initialized by another thread;

If the final read of x_init could somehow be performed after the first access to x, this code could erroneously see an uninitialized value for x. The fence instruction inserted after a synchronization load, in this case that of x_init, prevents this reordering and ensures that x is really only accessed after x_init is set.

Synchronization stores

Synchronization stores are implemented as store instructions that are both preceded and followed by a fence instruction.

To see why the preceding fence is needed, consider the thread initializing x in the preceding example. It was written as

x = initial value;
x_init = true;

If the two assignments in this thread could become visible out of order, the reading thread could again see x_init set before x is properly initialized. The fence preceding the assignment to the synchronization variable x_init prevents this reordering.

The fence following the x_init is not necessary in this example, but is required in the general case to prevent reordering of a synchronization store with a subsequent synchronization load. To see this, reexamine our introductory example, interpreting both of the variables x and y as synchronization variables:

Thread 1	Thread 2
`x = 1; r1 = y;`	`y = 1; r2 = x;`

Neither a fence preceding stores, nor a fence following loads, prevents the assignments in each thread from being reordered, which could result in a non-sequentially-consistent r1 = r2 = 0 outcome. To prevent this kind of reordering, we need a fence either following a synchronization store, or preceding a synchronization load. Since loads tend to be more frequent, we will usually choose the former convention.

Locks

A simple spin-lock could be represented as a variable l holding either 0 (unlocked) or 1 (locked). A lock operation could be implemented by setting a register, say r1 to 1, and then repeatedly executing xchg r1, [l] until r1 becomes 0, indicating that our thread managed to change l to its locked state. This would need to be followed by a fence instruction to ensure that operations intended to execute with mutual exclusion don't become visible until the lock is actually acquired. (Real implementations would typically avoid the repeated xchg operations while waiting for a lock, since they create large amounts of unnecessary communication between processor caches.)

The above spin-lock could be released by using an ordinary store instruction to reset l to 0. This time we need a leading fence to prevent operations from "escaping" the critical section by becoming visible after the unlock() operation.

Note that lock()/unlock() operations only require one fence each, not two like a synchronization store operation. This relies on an argument that the difference is not observable in the absence of data races, which in turn makes some assumptions on how locks can be accessed, as mentioned under "non-blocking lock acquisition" in the next section.

Many real machine architectures allow the hardware to perform less reordering of memory operations than we have assumed here. For example, on conventional "X86" machines, no explicit fences are required if a synchronization store is implemented with an xchg instruction, as it probably should be. In addition, synchronization loads can be implemented with ordinary load instructions, and a simple lock can be released with an ordinary store instruction.

Omissions and extensions

Here we have focussed only on issues pertaining to simple uses of shared variables across multiple threads. These are the most fundamental issues that any author of multithreaded programs should be familiar with. We've omitted some other facilities that are often provided by threads implementations. Though somewhat less frequently used, they may still be essential to your application:

Other lock types: Most environments provide "reentrant" locks that may be held multiple times, but only by a single thread. This is, for example, the behavior for the "locks" acquired by Java synchronized blocks. It is also common to provide reader-writer locks, that can be held by more than one "reader" thread, but only by a single "writer".
Condition Variables, etc.: Condition variables are the most common mechanism for having a thread wait until a particular condition is satisfied, e.g. until a shared queue is nonempty. These provide a wait() call that releases and reacquires a lock, suspending the thread for a while in between, at most until it is awoken ("notified" or signalled") by another thread. Condition variables can easily introduce deadlock into programs, if a thread is not woken up reliably. Hence they should be used cautiously. But if our only concern is that our program not produce incorrect (as opposed to no) answers, a wait(cv, lock) call behaves just like a unlock(lock); lock(lock) sequence. Thus the new issues are very much orthogonal to this paper.
Non-blocking lock acquisition: Most languages or thread libraries provide a trylock() primitive that either acquires a lock, like lock(), or returns an error indication. Unlike lock(), it never waits. This is easily accommodated in our model, provided we allow trylock() to potentially return an error indication even if the lock is available, or at least reason about our code as though this could happen. For subtle reasons this assumption is required to disallow programs that could otherwise see through the implementation's carefully provided illusion of sequential consistency. Similar observations apply for lock acquisition calls that can time out. In practice, this disallows only code that abuses trylock(), and should have used other primitives. (The alternative is a lock implementation that is significantly more expensive on many architectures. Experience suggests such implementations will not be used, whether or not standards appear to require them.)
"Escapes" from sequential consistency: Many languages allow some accesses to synchronization variables that explicitly violate sequential consistency, even for programs that contain no data races. On some platforms, using these may significantly improve performance, at the cost of a far more complicated programming model, which is not discussed here. For example, java.util.concurrent.atomic's lazySet() (Java 6 +) and weakCompareAndSet() allow implementations to reorder operations in ways inconsistent with sequential consistency. C++0x atomic objects support operations with explicit memory ordering constraint arguments (e.g. memory_order_relaxed), most of which are again inconsistent with sequential consistency.

History and acknowledgments

The notion of sequential consistency was introduced by Lamport, How to Make a Multiprocessor Computer that Correctly Executes Multiprocess Programs.

The sequential-consistency-for-data-race-free-programs approach has been the basis of the Ada programming model from the beginning. It was explored in detail, as a model to be efficiently supported by machine architectures, by Sarita Adve, starting about 20 years ago. (See for example Adve and Hill, Weak Ordering - A New Definition, ISCA 1990.) As we pointed out above, the Posix threads model at least approximates it, but many users of Posix threads seemed to be unaware of this. We believe the Windows native threads model was also intended to approximate it, but haven't found much discussion on the subject until very recently.

Also in the last few years, first the Java and then the C++0x memory models were explicitly based on this approach. Aside from the authors of those respective papers, major contributors included Lawrence Crowl, Doug Lea, Paul McKenney, Clark Nelson, Bratin Saha, and Herb Sutter.

Paul McKenney coauthored a "frequently asked questions" document whose content overlaps with this paper. Sarita Adve provided some of the text in the implementation section. We borrowed at least one of Herb Sutter's examples.

Many readers provided useful comments on earlier drafts. Rob Schreiber provided particularly extensive ones.