Edd Barrett, Tue 15 April 2025.
tags: codegen llvm floats NaN X86_64
Introduction
Recently I was studying LLVM's X86_64 code generation for fcmp
instructions
and encountered a neat trick LLVM uses to emit shorter code for some
comparisons. This post explains this trick, but also serves as a brief primer
into unordered floating point values and NaN, since they play a key role.
The fcmp
instruction and some floating point preliminaries
LLVM is a compiler construction toolkit that uses an Intermediate Language (IR)
which is compiled down to native executable code for a target CPU. The IR
has an instruction called fcmp
that compares floating point numbers. An
fcmp
takes two floating point operands and a "predicate" that describes what
relation you want to test. It then produces a Boolean i1
value indicating
whether the relation was true or not.
Here's an example we will use throughout this post:
%r = fcmp olt float %x, %y
Here %x
and %y
are the float-typed SSA
variables to
compare and olt
is the predicate.
If we look up olt
in the fcmp
section of the LLVM language reference
then we get
this description:
ordered and less than
So the "o" here corresponds with "ordered", and "lt" corresponds with
"less than"[0].
They also give us an informal semantics of olt
:
yields true if both operands are not a QNAN and op1 is less than op2
If you've not worked with floats before, then these quotes probably raise more
questions than answers. To fully understand we have to talk about IEE 754
floating point numbers a bit (if you already know what is meant by "unordered"
and "QNAN", you can skip to the next section).
IEEE 754 (the floating point standard used by most mainstream CPUs) reserves a
chunk of the encoding space for so-called NaN (Not a Number) values. Some
floating point operations generate one of these NaNs if, given the input
operands, the result is "undefined as a number". To give a few
examples[1]:
- dividing zero by zero generates NaN
- modulo with zero (
x % 0
) generates NaN
- many float operations will propagate a NaN to their result if the there is a
NaN present in the input operands
For this discussion, what's most important is that NaN is an "unordered" value
which, as we shall soon see, has special properties when used in
comparisons[2]. Conversely, all non-NaN floating point values
are "ordered".
But the quote above talks about "QNAN", not "NaN". So what's that? Section
4.8.3.4 of the Intel manual explains[3]:
The IA-32 architecture defines two classes of NaNs: quiet NaNs (QNaNs) and
signaling NaNs (SNaNs). A QNaN is a NaN with the most significant fraction
bit set; an SNaN is a NaN with the most significant fraction bit clear. QNaNs
are allowed to propagate through most arithmetic operations without signaling
an exception. SNaNs generally signal a floating-point invalid-operation
exception whenever they appear as operands in arithmetic operations.
This shines light on the semantics of fcmp
, which is explicitly concerned
with QNaN values as operands. Note also that the above passage talks about
SNaN's potential to raise an exception when used as an operand. There's some
nuance here, but for the purposes of this post, it's sufficient to know that as
long as we don't provide fcmp
with a SNaN operand then it won't raise an FP
invalid-operation exception at runtime[4].
With any luck, you can now tie the olt
predicate's description ("ordered and
less than") to its semantics ("yields true if both operands are not a QNaN and
op1 is less than op2").
Here are a few example fcmp
instructions and their outcomes:
fcmp olt float 1.0, 2.0
would compute 1i1
(i.e. true)
fcmp olt float 2.0, 1.0
would compute 0i1
(i.e. false)
fcmp olt float QNaN, 1.0
would compute 0i1
, because one operand is
unordered[5].
With those prerequisites out of the way, we can now we can start thinking about
how to generate machine code.
A naive X86_64 codegen for fcmp olt
Recapping, we want to make X86_64 executable code for this LLVM IR instruction:
%r = fcmp olt float %x, %y
Searching the Intel manual for instructions that compare floating point values,
you will likely stumble on the following two instructions[6]:
comiss
- Compare Scalar Ordered Single Precision Floating-Point Values and
Set EFLAGS
ucomiss
- Unordered Compare Scalar Single Precision Floating-Point Values
and Set EFLAGS
We are in the right ball park, but we have to be careful because there's a trap
for the unwary. ucomiss
is described as an "unordered" comparison, and
comiss
is not, which implies the latter is an ordered comparison, just like
our olt
predicate, right? So we should use comiss
?
Well, no. Intel's meaning of "unordered" and (implicitly) "ordered" here is
different to the meaning used in the LLVM fcmp
docs. Section 10.4.2.1 of the
Intel manual explains:
The COMISS (compare scalar single precision floating-point values and set
EFLAGS) and UCOMISS (unordered compare scalar single precision floating-point
values and set EFLAGS) instructions compare the low values of two packed single
precision floating-point operands and set the ZF, PF, and CF flags in the
EFLAGS register to show the result (greater than, less than, equal, or
unordered). These two instructions differ as follows: the COMISS instruction
signals a floating-point invalid-operation (#I) exception when a source operand
is either a QNaN or an SNaN; the UCOMISS instruction only signals an
invalid-operation exception when a source operand is an SNaN.
10.4.2.2 Intel® SSE Shuffle and Unpack Instructions
In other words, comiss
raises a floating point exception if provided with any
kind of NaN operand (and the exception isn't masked), whereas ucomiss
allows
through QNaN but not SNaN. As we learned earlier, passing LLVM's fcmp
a QNaN
isn't grounds to raise an exception, so ucomiss
is what we should use here.
In fact, ucomiss
is the one you want for all LLVM fcmp
predicates.
ucomiss
alone isn't enough though. As mentioned in the quote above, ucomiss
sets some flags in the EFLAGS
register that we will next have to inspect. In
Section 3.3 of the manual they give you pseudo-code equivalent to the following
truth table:
Result / Flag |
ZF |
PF |
CF |
UNORDERED |
1 |
1 |
1 |
> |
0 |
0 |
0 |
< |
0 |
0 |
1 |
= |
1 |
0 |
0 |
And crucially, they also tell you:
The unordered result is returned if either source operand is a NaN (QNaN or
SNaN).
To determine if the relation we are testing held, it's just a matter of
interpreting the flag assignment in a way faithful to the predicate's
semantics.
For %r = fcmp olt float %x, %y
, the result, %r
, should only be set to 1
(true) if both operands were ordered (not QNaNs) and %x < %y
. An obvious
way to get the right result is therefore to check that the following conditions
are met:
PF=0
-- checks that the result was ordered (i.e. both operands were ordered) and,
CF=0
-- assuming the result is ordered, checks that %x < %y
.
We can check these conditions with the X86_64 setCC
family of instructions,
so a correct assembler routine is:
; Assume:
; - %x is in the xmm0 register.
; - %y is in the xmm1 register.
; - The result, %r, will go in the `al` register.
ucomiss xmm0, xmm1
setnp al ; Is the result ordered (PF=0)? Store result in al.
setb bl ; Assuming the result is ordered, is xmm0 < xmm1 (CF=0)? Store result in bl.
and al, bl ; Are both of the above true? Store result in al.
Note that there is no setCC
instruction that can check both PF=0
and
CF=0
at the same time, so we have to check those flags separately and use an
and
instruction afterwards.
A better implementation
We could stop now, but I noticed a neat trick that LLVM does to get the desired
semantics with one (not two) setCC
instructions.
Because mathematically x < y
is equivalent to y > x
, we can rewrite %r =
fcmp olt float %x, %y
into %r = fcmp ogt float %y, %x
. Glancing at the truth
table above, for an ogt
predicate it's sufficient to check that CF=0
and
ZF=0
. And what do you know? X86_64 has an instruction that can do exactly
that.
The following is therefore equivalent to, but shorter than, the code we
generated in the last section:
ucomiss xmm1, xmm0 ; operands swapped
seta al ; Is the result ordered and xmm1 < xmm0 (CF=1 and ZF=0)? Store result in al.
Neat huh? And what's more, it's a trick that can be applied similarly to the
ole
, ugt
and uge
predicates. Here's the LLVM code where it decides
whether to invert the predicate and swap the
operands.
Try it out yourself
Put this in fcmp.ll
:
target triple = "amd64-unknown-openbsd7.7"
define i1 @f(float %x, float %y) {
%r = fcmp olt float %x, %y
ret i1 %r
}
And do:
$ clang -O0 -c fcmp.ll
$ objdump -M intel -d fcmp.o
fcmp.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <f>:
0: 0f 2e c8 ucomiss xmm1,xmm0
3: 0f 97 c0 seta al
6: c3 ret
[ Show in compiler explorer ]
Acknowledgements
Thanks to Davin McCall and Jake Hughes for reviewing this post and providing
comments.
Thanks also to @arsenm and @dzaima from LLVM discord for making me aware of
LLVM's assumptions about SNaN and of the constrained float operations
(discussed below).
Further reading
An interesting quirk of IEEE 754 floating point numbers is that QNaN and SNaN
encodings are not canonical and therefore their encoding space can be used for
other purposes.
Footnotes
- [0]: Most of the LLVM predicates starting with "o"
mean "ordered and", whereas those starting with an "u" mean "unordered
or". Then there are a few outliers:
true
, false
, ord
and uno
which
mean "always true", "always false", "ordered" and "unordered".
See here for more.
- [1]: This list isn't exhaustive. See
here for more.
- [2]: "unordered" is in reference to NaN values
being incomparable to the other values in the partial order that floats form.
This is also the reason why in Rust, the
float
type only implements the
PartialOrd
trait
and not the Ord
trait.
- [3]: You may also notice that LLVM tends to use
capitalisation like "NAN", whereas the Intel manual tends to use "NaN". This
difference is not meaningful in any way. From now on I will use the little "a"
variant unless quoting. Similarly, I'll use "QNaN" and "SNaN".
- [4]: If we do pass
fcmp
at least one SNaN operand
then it might raise an exception. Why this is isn't really the point of
this post, but nonetheless I've attempted to explain in the
appendix below.
- [5]: Note that
QNaN
isn't a valid literal in the LLVM
IR grammar and is used here only for demonstration purposes. To construct an
equivalent instruction, you'd have to find another way to make a QNaN SSA
value and pass it as an operand. See the appendix for one such
way.
- [6]: There's also
comisd
and ucomisd
for working with
double
s. The s
and d
in the instruction names refer to s
ingle
precision or d
ouble precision.
Appendix / Discussion
The following LLVM IR program shows:
-
How to unmask all floating point exceptions using a libm
utility function
(usually floating point exceptions are disabled by default).
-
How to make QNaN and SNaN values in LLVM IR.
-
That we can make a fcmp
raise the FP invalid-operation exception when
passed a SNaN, but not a QNaN.
declare i32 @feenableexcept(i32)
declare i32 @printf(ptr, ...)
@fmt = internal constant [4 x i8] c"ok\0a\00"
define i1 @cmp(float %x, float %y) noinline optnone {
%res = fcmp ult float %x, %y
ret i1 %res
}
define i32 @main() {
; turn on all fp exceptions.
%_ = call i32 @feenableexcept(i32 -1);
; make a qnan
;
; would have liked to do this, but llvm won't parse a qnan literal.
;
; %snan = fadd float 0x7fc00000, 0.0
;
; we bitcast the bit-pattern instead.
%qbits = add i32 2143289344, 0 ; 0x7fc00000
%qnan = bitcast i32 %qbits to float
; make an snan with the same trick
%sbits = add i32 2141192192, 0 ; 0x7fa00000
%snan = bitcast i32 %sbits to float
; compare values
call i1 @cmp(float 1.0, float %qnan)
call i32 @printf(ptr @fmt) ; we get here and print "ok"
call i1 @cmp(float 1.0, float %snan)
call i32 @printf(ptr @fmt) ; but we crash before we get here
ret i32 0
}
If we then compile and run this, we see that the second fcmp
crashes the
program (the exception is delivered to userspace as SIGFPE
):
$ clang -O2 nan.ll -lm
$ ./a.out
ok
zsh: floating point exception (core dumped) ./a.out
[ Show in compiler explorer ] (but note that CE doesn't show stdout if the
program crashes)
You may have noticed that I hauled the fcmp
into its own function annotated
noinline noopt
. Why did I do that?
Here and
here
the LLVM docs say:
The default LLVM floating-point environment assumes that traps are disabled
and status flags are not observable. Therefore, floating-point math
operations do not have side effects and may be speculated freely.
and:
Floating-point math operations are allowed to treat all NaNs as if they were
quiet NaNs. For example, “pow(1.0, SNaN)” may be simplified to 1.0.
In other words, LLVM will optimise fcmp
based on the assumption that SNaNs
can't happen. Indeed, if in the above program you remove noinline optnone
from the @cmp
function, then the program no longer raises and exception!
The reason for this is that the optimiser will inline @cmp
, assume the fcmp
operands are not SNaN, apply constant propagation and arrive at a @main
like
so:
...
define i32 @main() local_unnamed_addr {
%_ = tail call i32 @feenableexcept(i32 -1)
%1 = tail call i32 @printf(ptr nonnull @fmt)
%2 = tail call i32 @printf(ptr nonnull @fmt)
ret i32 0
}
...
Since the float comparisons have been eliminated, there can be no floating
point exceptions.
If you don't want LLVM to make these assumptions, you can use use LLVM's
"constrained" floating point
intrinsics, for example
llvm.experimental.constrained.fcmp
.
I'll leave this as an exercise for the reader though.