Edd Barrett, Thu 14 October 2021.
tags: Compilers Optimisation
Introduction
This article started as a series of rough notes in a GitHub Gist, only really
intended for my own personal reference. But after some badgering from Laurence
Tratt, I polished it up, and here it is in blog post form.
The problem
Consider the scenario where we want to benchmark some code:
bench_input = 42;
start_time = time();
bench_output = run_bench(bench_input);
result = time() - start_time;
Looks good, right?
There is a problem: an optimising compiler will move code around to improve
runtime performance. In general, a compiler is free to reorder computations
that are independent of one another.
In the example, since there is no dependency between time()
and the code for
the benchmark itself, the compiler may try to re-order the code to this:
bench_output = run_bench(42);
start_time = time();
result = time() - start_time;
Or this [0]:
start_time = time();
result = time() - start_time;
bench_output = run_bench(42);
Or it might be able to statically determine bench_output
without actually
doing any computations at runtime:
start_time = time();
bench_output = 80;
result = time() - start_time;
Worse, if we don't use bench_output
as an input to some side-effecting operation (e.g. printing it to stdout), then the compiler could decide that there's no point in computing it. That's bad, because in that case the compiler is free to remove the entire benchmark altogether!
start_time = time();
result = time() - start_time;
There are other variations of what is possible, but the point is, if code is
reordered or removed, it's likely to lead to incorrect benchmarking timings.
To get the desired outcome, there are therefore two problems we need to solve:
- Stop the benchmark computations being optimised out.
- Ensure the benchmark computations stay anchored between the two calls to
time()
.
There have been many historical incarnations of inline assembler blocks which
are designed to address these problems. In this article we will look at Google's
implementations for benchmarking C++ code.
DoNotOptimize()
DoNotOptimize()
is a function which both forces a value to exist
(prevents a value from being removed), and which enforces ordering with certain other surrounding operations.
Here is how we can use it for our benchmarking example:
bench_input = 42;
start_time = time();
DoNotOptimize(bench_input)
bench_output = run_bench(bench_input);
DoNotOptimize(bench_output)
result = time() - start_time;
This ensures that bench_input
and bench_output
exist (are not optimised
out), and that the benchmarking code is "clamped" between the two calls to
time()
.
Why does this work? To understand, we have to dig deeper into the
implementation of DoNotOptimize()
. Here is its definition (for code being
compiled with clang):
inline BENCHMARK_ALWAYS_INLINE void DoNotOptimize(Tp& value) {
asm volatile("" : "+r,m"(value) : : "memory");
}
What does this do? Well first, let's remind ourself of GNU extended inline asm
syntax:
asm asm-qualifiers ( AssemblerTemplate
: OutputOperands
[ : InputOperands
[ : Clobbers ] ])
First the asm block is marked volatile
. In the context of
inline assembler, volatile
ensures that the compiler may not "optimise" the
contents of the asm block or remove it altogether: it must be included verbatim, even if it looks optimisable. For DoNotOptimize()
, it prevents the obviously empty asm block
from being deleted (the AssemblerTemplate
parameter of the inline asm
contains no assembler instructions).
CRITICAL POINT #1: The asm block may not be removed.
Second, DoNotOptimize()
"clobbers memory" (the Clobbers
asm parameter is
memory
). This (quoting the GNU inline asm docs):
tells the compiler that the
assembly code performs memory reads or writes to items other than those listed
in the input and output operands.
The set of memory addresses potentially read
and written also includes global state, both internal and external to the
current compilation unit, so -- in the eyes of the compiler -- the asm block
may have global side-effects (even though in reality, the asm block
reads and writes precisely nothing).
(Note however that memory clobbering alone may not fool the compiler into
thinking that the value
argument to DoNotOptimize()
is potentially mutated:
it may determine, via escape analysis, that value
is only ever local the
current frame, and that global side-effects couldn't possibly
mutate value
. Read on.)
CRITICAL POINT #2: The compiler must assume that the asm block has global side-effects.
Third, the output constraints ("+r,m"(value)
of the asm block) state that C++ variable
value
, which must be allocated in either a memory or a register (r,m
) is both read from and written to (+
). So to the
compiler (and in addition to global side-effects), the block also does value = non_deterministically_mutate(value)
.
So the output constraints give us two additional useful properties:
CRITICAL POINT #3: The argument to DoNotOptimize()
is forced to exist.
CRITICAL POINT #4: The compiler must assume that the argument to DoNotOptimize()
may be mutated by the assembler block.
The last thing we need to talk about is the time()
function from our example.
time()
is assumed to be some external code (e.g. in libc) with an unspecified
implementation. Because time()
is opaque, the compiler must also assume that this
call may have global side-effects[1]. This leads us to the final piece
of the puzzle:
CRITICAL POINT #5: Two operations which may have global side-effects must not be reordered, because if the first operation has side-effects, the second operation must observe them.
In our example, DoNotOptimize(bench_input)
is expected to observe any
potential side effects of the preceding call to time()
, so they cannot be
re-ordered by the compiler.
Putting it all together
Once we've digested our critical points, we see why DoNotOptimize()
achieves
the desired effect for our example.
bench_input = 42;
// May have global side-effects.
start_time = time();
// Also may have global side-effects.
// Needs to observe any side-effects of `time()`, so can't be re-ordered before it.
// Forces `bench_input` to be materialized, despite it being a constant.
DoNotOptimize(bench_input)
// Here the compiler must assume that `bench_input` has now been mutated.
// Is expected to observe the potentially mutated value of `bench_input`, therefore
// cannot be reordered before `DoNotOptimize()`.
bench_output = run_bench(bench_input);
// May have global side-effects.
// Depends on `bench_output` so cannot be reordered above `run_bench()`.
DoNotOptimize(bench_output)
// Also may have global side-effects.
// Needs to observe any potential side effects of `DoNotOptimize(bench_output)`, so
// cannot be reordered before it.
result = time() - start_time;
Would a variant of DoNotOptimize()
that doesn't clobber memory be useful for anything?
Consider this implementation, which I'm calling EnsureMaterialise
:
inline BENCHMARK_ALWAYS_INLINE void EnsureMaterialise(Tp& value) {
asm volatile("" : "+r,m"(value) : :); // Doesn't clobber memory.
}
Is it useful for anything?
By removing the memory clobber, the compiler no longer considers the asm block to have potentially global side-effects, meaning that we lose the ability to order the asm block with other operations that may have global side-effects. We couldn't use EnsureMaterialise()
for the benchmarking example, as code could float around too much.
However, EnsureMaterialise(value)
does still:
- Ensure that
value
exists and is not optimised out.
- Make the compiler think that the
value
is potentially mutated.
So this might be useful for some niche use-cases:
- Perhaps you want to defeat constant propagation, but don't care about reorderings?
- Perhaps you want to hot-patch the value at runtime (say in a JIT) and need to ensure that
value
has some storage space that you can overwrite.
On a related note, Google does provide a ClobberMemory()
function, which only clobbers memory.
In Summary
An optimising compiler can freely reorder computations which it deems to be
independent. More often than not this is the desired outcome as it can improve
performance, but in some situations this is undesirable. Careful use of inline
assembler can be used to tame the optimiser in such scenarios.
Further reading/viewing:
Footnotes
-
[0] EDIT 2021-10-16: These first two
reorderings could occur if either: a) the compiler knows nothing about the
implementation of time()
(and must conservatively assume that it may read/write
global state), but can prove that run_bench()
reads/writes only local state
and that no pointers can escape run_bench()
's frame; or b) the compiler knows
that time()
cannot read/write global state.
-
[1] In theory it might be possible to use Link Time
Optimisation (LTO) to see the internals of functions in shared libraries.
To my knowledge this isn't possible at the time of writing, but if it does
become possible, then the compiler could determine the exact (potentially
not global) side-effects of such functions and we might need additional
calls to DoNotOptimize()
or ClobberMemory()
.
Discussions
Thanks
A big thanks to:
- Peter Cordes for helping me understand this stuff.
- Laurence Tratt and Davin McCall for comments.