Memory Barriers in C++

What are memory Barriers?

A memory barrier is a type of instruction that is given to Compiler or CPU to ensure that all the instructions before the barrier instruction should occur before and all instructions after the barrier should occur after.

why do we need a memory barrier?

Modern compilers and CPUs may execute instructions in any order for the purpose of optimizations provided it does not break the apparent operations of the program. But this can cause race conditions or unexpected behaviors in multi-threaded programs. so, to prevent this unexpected behavior, memory barriers are used.

Let’s first understand Compiler reordering and CPU out of order execution:

1. Compiler reordering

but doesn't the compiler execute code line by line, and if so, then why do we need a memory barrier?

Actually, Compiler does not always execute the code line by line. It sometimes for the sake of optimizations reorder some instructions which might seem to be independent.

Take a look at this code for example:

#include <atomic>

volatile int threadToBeExecuted;
int commonVariable;

int computeCommonVariable();

void spinCall(int currentThread)
{
    while(currentThread != threadToBeExecuted);

    commonVariable = computeCommonVariable();

    threadToBeExecuted = currentThread + 1;
}

Now this code looks simple spinlock implementation but if we compile it with optimizations:

g++ -S -O2 spin_lock.cpp -o spin_lock_optimized.s

The assembly code generated looks something like this:

.L2:
        movl    threadToBeExecuted(%rip), %eax
        cmpl    %ebx, %eax
        jne     .L2
        call    _Z21computeCommonVariablev@PLT
        addl    $1, %ebx
        **movl    %ebx, threadToBeExecuted(%rip)**
        popq    %rbx
        .cfi_def_cfa_offset 8
        movl    %eax, commonVariable(%rip)
        ret
        .cfi_endproc

Now, in this code, you can see, that some instructions are moved. Which for the first time might look good that the compiler has made some optimizations, but the thing is, the instructions that seems to be independent of each other were not actually independent, as now we are updating the commonVariable after incrementing threadToBeExecuted. Which might cause an issue, as there might be chances that 2nd thread will modify commonVariable before thread 1.

Which actually should have looked like this:

.L2:
        movl    threadToBeExecuted(%rip), %eax
        cmpl    %eax, -4(%rbp)
        setne   %al
        testb   %al, %al
        jne     .L2
        call    _Z21computeCommonVariablev@PLT
        movl    %eax, commonVariable(%rip)
        movl    -4(%rbp), %eax
        addl    $1, %eax
        movl    %eax, threadToBeExecuted(%rip)
        nop
        leave
        .cfi_def_cfa 7, 8
        ret

2. Out-of-order CPU execution

Similar to Compiler optimizations, there are some optimizations that are made by CPU which can cause unexpected behavior in multi-threaded environment.

For Example: