Edd Barrett, Thu 14 October 2021.

Introduction

This article started as a series of rough notes in a GitHub Gist, only really intended for my own personal reference. But after some badgering from Laurence Tratt, I polished it up, and here it is in blog post form.

The problem

Consider the scenario where we want to benchmark some code:

bench_input = 42;
start_time = time();
bench_output = run_bench(bench_input);
result = time() - start_time;

Looks good, right?

There is a problem: an optimising compiler will move code around to improve runtime performance. In general, a compiler is free to reorder computations that are independent of one another.

In the example, since there is no dependency between time() and the code for the benchmark itself, the compiler may try to re-order the code to this:

bench_output = run_bench(42);
start_time = time();
result = time() - start_time;

Or this [0]:

start_time = time();
result = time() - start_time;
bench_output = run_bench(42);

Or it might be able to statically determine bench_output without actually doing any computations at runtime:

start_time = time();
bench_output = 80;
result = time() - start_time;

Worse, if we don't use bench_output as an input to some side-effecting operation (e.g. printing it to stdout), then the compiler could decide that there's no point in computing it. That's bad, because in that case the compiler is free to remove the entire benchmark altogether!

start_time = time();
result = time() - start_time;

There are other variations of what is possible, but the point is, if code is reordered or removed, it's likely to lead to incorrect benchmarking timings.

To get the desired outcome, there are therefore two problems we need to solve:

Stop the benchmark computations being optimised out.
Ensure the benchmark computations stay anchored between the two calls to time().

There have been many historical incarnations of inline assembler blocks which are designed to address these problems. In this article we will look at Google's implementations for benchmarking C++ code.

DoNotOptimize()

DoNotOptimize() is a function which both forces a value to exist (prevents a value from being removed), and which enforces ordering with certain other surrounding operations.

Here is how we can use it for our benchmarking example:

bench_input = 42;
start_time = time();
DoNotOptimize(bench_input)
bench_output = run_bench(bench_input);
DoNotOptimize(bench_output)
result = time() - start_time;

This ensures that bench_input and bench_output exist (are not optimised out), and that the benchmarking code is "clamped" between the two calls to time().

Why does this work? To understand, we have to dig deeper into the implementation of DoNotOptimize(). Here is its definition (for code being compiled with clang):

inline BENCHMARK_ALWAYS_INLINE void DoNotOptimize(Tp& value) {
    asm volatile("" : "+r,m"(value) : : "memory");
}

What does this do? Well first, let's remind ourself of GNU extended inline asm syntax:

asm asm-qualifiers ( AssemblerTemplate
                 : OutputOperands
                 [ : InputOperands
                 [ : Clobbers ] ])

First the asm block is marked volatile. In the context of inline assembler, volatile ensures that the compiler may not "optimise" the contents of the asm block or remove it altogether: it must be included verbatim, even if it looks optimisable. For DoNotOptimize(), it prevents the obviously empty asm block from being deleted (the AssemblerTemplate parameter of the inline asm contains no assembler instructions).

CRITICAL POINT #1: The asm block may not be removed.

Second, DoNotOptimize() "clobbers memory" (the Clobbers asm parameter is memory). This (quoting the GNU inline asm docs):

tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands.

The set of memory addresses potentially read and written also includes global state, both internal and external to the current compilation unit, so -- in the eyes of the compiler -- the asm block may have global side-effects (even though in reality, the asm block reads and writes precisely nothing).

(Note however that memory clobbering alone may not fool the compiler into thinking that the value argument to DoNotOptimize() is potentially mutated: it may determine, via escape analysis, that value is only ever local the current frame, and that global side-effects couldn't possibly mutate value. Read on.)

CRITICAL POINT #2: The compiler must assume that the asm block has global side-effects.

Third, the output constraints ("+r,m"(value) of the asm block) state that C++ variable value, which must be allocated in either a memory or a register (r,m) is both read from and written to (+). So to the compiler (and in addition to global side-effects), the block also does value = non_deterministically_mutate(value).

So the output constraints give us two additional useful properties:

CRITICAL POINT #3: The argument to DoNotOptimize() is forced to exist.

CRITICAL POINT #4: The compiler must assume that the argument to DoNotOptimize() may be mutated by the assembler block.

The last thing we need to talk about is the time() function from our example. time() is assumed to be some external code (e.g. in libc) with an unspecified implementation. Because time() is opaque, the compiler must also assume that this call may have global side-effects[1]. This leads us to the final piece of the puzzle:

CRITICAL POINT #5: Two operations which may have global side-effects must not be reordered, because if the first operation has side-effects, the second operation must observe them.

In our example, DoNotOptimize(bench_input) is expected to observe any potential side effects of the preceding call to time(), so they cannot be re-ordered by the compiler.

Putting it all together

Once we've digested our critical points, we see why DoNotOptimize() achieves the desired effect for our example.

bench_input = 42;

// May have global side-effects.
start_time = time();

// Also may have global side-effects.
// Needs to observe any side-effects of `time()`, so can't be re-ordered before it.
// Forces `bench_input` to be materialized, despite it being a constant.
DoNotOptimize(bench_input)
// Here the compiler must assume that `bench_input` has now been mutated.

// Is expected to observe the potentially mutated value of `bench_input`, therefore
// cannot be reordered before `DoNotOptimize()`.
bench_output = run_bench(bench_input);

// May have global side-effects.
// Depends on `bench_output` so cannot be reordered above `run_bench()`.
DoNotOptimize(bench_output)

// Also may have global side-effects.
// Needs to observe any potential side effects of `DoNotOptimize(bench_output)`, so
// cannot be reordered before it.
result = time() - start_time;

Would a variant of `DoNotOptimize()` that doesn't clobber memory be useful for anything?

Consider this implementation, which I'm calling EnsureMaterialise:

inline BENCHMARK_ALWAYS_INLINE void EnsureMaterialise(Tp& value) {
    asm volatile("" : "+r,m"(value) : :); // Doesn't clobber memory.
}

Is it useful for anything?

By removing the memory clobber, the compiler no longer considers the asm block to have potentially global side-effects, meaning that we lose the ability to order the asm block with other operations that may have global side-effects. We couldn't use EnsureMaterialise() for the benchmarking example, as code could float around too much.

However, EnsureMaterialise(value) does still:

Ensure that value exists and is not optimised out.
Make the compiler think that the value is potentially mutated.

So this might be useful for some niche use-cases:

Perhaps you want to defeat constant propagation, but don't care about reorderings?
Perhaps you want to hot-patch the value at runtime (say in a JIT) and need to ensure that value has some storage space that you can overwrite.

On a related note, Google does provide a ClobberMemory() function, which only clobbers memory.

In Summary

An optimising compiler can freely reorder computations which it deems to be independent. More often than not this is the desired outcome as it can improve performance, but in some situations this is undesirable. Careful use of inline assembler can be used to tame the optimiser in such scenarios.

Footnotes

[0] EDIT 2021-10-16: These first two reorderings could occur if either: a) the compiler knows nothing about the implementation of time() (and must conservatively assume that it may read/write global state), but can prove that run_bench() reads/writes only local state and that no pointers can escape run_bench()'s frame; or b) the compiler knows that time() cannot read/write global state.
[1] In theory it might be possible to use Link Time Optimisation (LTO) to see the internals of functions in shared libraries. To my knowledge this isn't possible at the time of writing, but if it does become possible, then the compiler could determine the exact (potentially not global) side-effects of such functions and we might need additional calls to DoNotOptimize() or ClobberMemory().

Discussions

Thanks

A big thanks to:

Peter Cordes for helping me understand this stuff.
Laurence Tratt and Davin McCall for comments.

TheUnixZoo

Preventing an optimising compiler from removing or reordering your code

Introduction

The problem

DoNotOptimize()

Putting it all together

Would a variant of `DoNotOptimize()` that doesn't clobber memory be useful for anything?

In Summary

Further reading/viewing:

Footnotes

Discussions

Thanks

Introduction

The problem

DoNotOptimize()

Putting it all together

Would a variant of DoNotOptimize() that doesn't clobber memory be useful for anything?

In Summary

Further reading/viewing:

Footnotes

Discussions

Thanks

Would a variant of `DoNotOptimize()` that doesn't clobber memory be useful for anything?