Generating X86_64 code for LLVM's `fcmp` instruction

Edd Barrett, Tue 15 April 2025.

tags: codegen llvm floats NaN X86_64


Introduction

Recently I was studying LLVM's X86_64 code generation for fcmp instructions and encountered a neat trick LLVM uses to emit shorter code for some comparisons. This post explains this trick, but also serves as a brief primer into unordered floating point values and NaN, since they play a key role.

The fcmp instruction and some floating point preliminaries

LLVM is a compiler construction toolkit that uses an Intermediate Language (IR) which is compiled down to native executable code for a target CPU. The IR has an instruction called fcmp that compares floating point numbers. An fcmp takes two floating point operands and a "predicate" that describes what relation you want to test. It then produces a Boolean i1 value indicating whether the relation was true or not.

Here's an example we will use throughout this post:

%r = fcmp olt float %x, %y

Here %x and %y are the float-typed SSA variables to compare and olt is the predicate.

If we look up olt in the fcmp section of the LLVM language reference then we get this description:

ordered and less than

So the "o" here corresponds with "ordered", and "lt" corresponds with "less than"[0].

They also give us an informal semantics of olt:

yields true if both operands are not a QNAN and op1 is less than op2

If you've not worked with floats before, then these quotes probably raise more questions than answers. To fully understand we have to talk about IEE 754 floating point numbers a bit (if you already know what is meant by "unordered" and "QNAN", you can skip to the next section).

IEEE 754 (the floating point standard used by most mainstream CPUs) reserves a chunk of the encoding space for so-called NaN (Not a Number) values. Some floating point operations generate one of these NaNs if, given the input operands, the result is "undefined as a number". To give a few examples[1]:

  • dividing zero by zero generates NaN
  • modulo with zero (x % 0) generates NaN
  • many float operations will propagate a NaN to their result if the there is a NaN present in the input operands

For this discussion, what's most important is that NaN is an "unordered" value which, as we shall soon see, has special properties when used in comparisons[2]. Conversely, all non-NaN floating point values are "ordered".

But the quote above talks about "QNAN", not "NaN". So what's that? Section 4.8.3.4 of the Intel manual explains[3]:

The IA-32 architecture defines two classes of NaNs: quiet NaNs (QNaNs) and signaling NaNs (SNaNs). A QNaN is a NaN with the most significant fraction bit set; an SNaN is a NaN with the most significant fraction bit clear. QNaNs are allowed to propagate through most arithmetic operations without signaling an exception. SNaNs generally signal a floating-point invalid-operation exception whenever they appear as operands in arithmetic operations.

This shines light on the semantics of fcmp, which is explicitly concerned with QNaN values as operands. Note also that the above passage talks about SNaN's potential to raise an exception when used as an operand. There's some nuance here, but for the purposes of this post, it's sufficient to know that as long as we don't provide fcmp with a SNaN operand then it won't raise an FP invalid-operation exception at runtime[4].

With any luck, you can now tie the olt predicate's description ("ordered and less than") to its semantics ("yields true if both operands are not a QNaN and op1 is less than op2").

Here are a few example fcmp instructions and their outcomes:

  • fcmp olt float 1.0, 2.0 would compute 1i1 (i.e. true)
  • fcmp olt float 2.0, 1.0 would compute 0i1 (i.e. false)
  • fcmp olt float QNaN, 1.0 would compute 0i1, because one operand is unordered[5].

With those prerequisites out of the way, we can now we can start thinking about how to generate machine code.

A naive X86_64 codegen for fcmp olt

Recapping, we want to make X86_64 executable code for this LLVM IR instruction:

%r = fcmp olt float %x, %y

Searching the Intel manual for instructions that compare floating point values, you will likely stumble on the following two instructions[6]:

  • comiss - Compare Scalar Ordered Single Precision Floating-Point Values and Set EFLAGS
  • ucomiss - Unordered Compare Scalar Single Precision Floating-Point Values and Set EFLAGS

We are in the right ball park, but we have to be careful because there's a trap for the unwary. ucomiss is described as an "unordered" comparison, and comiss is not, which implies the latter is an ordered comparison, just like our olt predicate, right? So we should use comiss?

Well, no. Intel's meaning of "unordered" and (implicitly) "ordered" here is different to the meaning used in the LLVM fcmp docs. Section 10.4.2.1 of the Intel manual explains:

The COMISS (compare scalar single precision floating-point values and set EFLAGS) and UCOMISS (unordered compare scalar single precision floating-point values and set EFLAGS) instructions compare the low values of two packed single precision floating-point operands and set the ZF, PF, and CF flags in the EFLAGS register to show the result (greater than, less than, equal, or unordered). These two instructions differ as follows: the COMISS instruction signals a floating-point invalid-operation (#I) exception when a source operand is either a QNaN or an SNaN; the UCOMISS instruction only signals an invalid-operation exception when a source operand is an SNaN. 10.4.2.2 Intel® SSE Shuffle and Unpack Instructions

In other words, comiss raises a floating point exception if provided with any kind of NaN operand (and the exception isn't masked), whereas ucomiss allows through QNaN but not SNaN. As we learned earlier, passing LLVM's fcmp a QNaN isn't grounds to raise an exception, so ucomiss is what we should use here. In fact, ucomiss is the one you want for all LLVM fcmp predicates.

ucomiss alone isn't enough though. As mentioned in the quote above, ucomiss sets some flags in the EFLAGS register that we will next have to inspect. In Section 3.3 of the manual they give you pseudo-code equivalent to the following truth table:

Result / Flag ZF PF CF
UNORDERED 1 1 1
> 0 0 0
< 0 0 1
= 1 0 0

And crucially, they also tell you:

The unordered result is returned if either source operand is a NaN (QNaN or SNaN).

To determine if the relation we are testing held, it's just a matter of interpreting the flag assignment in a way faithful to the predicate's semantics.

For %r = fcmp olt float %x, %y, the result, %r, should only be set to 1 (true) if both operands were ordered (not QNaNs) and %x < %y. An obvious way to get the right result is therefore to check that the following conditions are met:

  • PF=0 -- checks that the result was ordered (i.e. both operands were ordered) and,
  • CF=0 -- assuming the result is ordered, checks that %x < %y.

We can check these conditions with the X86_64 setCC family of instructions, so a correct assembler routine is:

; Assume:
;   - %x is in the xmm0 register.
;   - %y is in the xmm1 register.
;   - The result, %r, will go in the `al` register.
ucomiss xmm0, xmm1
setnp al                ; Is the result ordered (PF=0)? Store result in al.
setb bl                 ; Assuming the result is ordered, is xmm0 < xmm1 (CF=0)? Store result in bl.
and al, bl              ; Are both of the above true? Store result in al.

Note that there is no setCC instruction that can check both PF=0 and CF=0 at the same time, so we have to check those flags separately and use an and instruction afterwards.

A better implementation

We could stop now, but I noticed a neat trick that LLVM does to get the desired semantics with one (not two) setCC instructions.

Because mathematically x < y is equivalent to y > x, we can rewrite %r = fcmp olt float %x, %y into %r = fcmp ogt float %y, %x. Glancing at the truth table above, for an ogt predicate it's sufficient to check that CF=0 and ZF=0. And what do you know? X86_64 has an instruction that can do exactly that.

The following is therefore equivalent to, but shorter than, the code we generated in the last section:

ucomiss xmm1, xmm0      ; operands swapped
seta al                 ; Is the result ordered and xmm1 < xmm0 (CF=1 and ZF=0)? Store result in al.

Neat huh? And what's more, it's a trick that can be applied similarly to the ole, ugt and uge predicates. Here's the LLVM code where it decides whether to invert the predicate and swap the operands.

Try it out yourself

Put this in fcmp.ll:

target triple = "amd64-unknown-openbsd7.7"

define i1 @f(float %x, float %y) {
  %r = fcmp olt float %x, %y
  ret i1 %r
}

And do:

$ clang -O0 -c fcmp.ll
$ objdump -M intel -d fcmp.o

fcmp.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <f>:
   0:   0f 2e c8                ucomiss xmm1,xmm0
   3:   0f 97 c0                seta   al
   6:   c3                      ret

[ Show in compiler explorer ]

Acknowledgements

Thanks to Davin McCall and Jake Hughes for reviewing this post and providing comments.

Thanks also to @arsenm and @dzaima from LLVM discord for making me aware of LLVM's assumptions about SNaN and of the constrained float operations (discussed below).

Further reading

An interesting quirk of IEEE 754 floating point numbers is that QNaN and SNaN encodings are not canonical and therefore their encoding space can be used for other purposes.

Footnotes

  • [0]: Most of the LLVM predicates starting with "o" mean "ordered and", whereas those starting with an "u" mean "unordered or". Then there are a few outliers: true, false, ord and uno which mean "always true", "always false", "ordered" and "unordered". See here for more.
  • [1]: This list isn't exhaustive. See here for more.
  • [2]: "unordered" is in reference to NaN values being incomparable to the other values in the partial order that floats form. This is also the reason why in Rust, the float type only implements the PartialOrd trait and not the Ord trait.
  • [3]: You may also notice that LLVM tends to use capitalisation like "NAN", whereas the Intel manual tends to use "NaN". This difference is not meaningful in any way. From now on I will use the little "a" variant unless quoting. Similarly, I'll use "QNaN" and "SNaN".
  • [4]: If we do pass fcmp at least one SNaN operand then it might raise an exception. Why this is isn't really the point of this post, but nonetheless I've attempted to explain in the appendix below.
  • [5]: Note that QNaN isn't a valid literal in the LLVM IR grammar and is used here only for demonstration purposes. To construct an equivalent instruction, you'd have to find another way to make a QNaN SSA value and pass it as an operand. See the appendix for one such way.
  • [6]: There's also comisd and ucomisd for working with doubles. The s and d in the instruction names refer to single precision or double precision.

Appendix / Discussion

The following LLVM IR program shows:

  • How to unmask all floating point exceptions using a libm utility function (usually floating point exceptions are disabled by default).

  • How to make QNaN and SNaN values in LLVM IR.

  • That we can make a fcmp raise the FP invalid-operation exception when passed a SNaN, but not a QNaN.

declare i32 @feenableexcept(i32)
declare i32 @printf(ptr, ...)

@fmt = internal constant [4 x i8] c"ok\0a\00"

define i1 @cmp(float %x, float %y) noinline optnone {
  %res = fcmp ult float %x, %y
  ret i1 %res
}

define i32 @main() {
  ; turn on all fp exceptions.
  %_ = call i32 @feenableexcept(i32 -1);

  ; make a qnan
  ;
  ; would have liked to do this, but llvm won't parse a qnan literal.
  ;
  ;   %snan = fadd float 0x7fc00000, 0.0
  ;
  ; we bitcast the bit-pattern instead.
  %qbits = add i32 2143289344, 0        ; 0x7fc00000
  %qnan = bitcast i32 %qbits to float

  ; make an snan with the same trick
  %sbits = add i32 2141192192, 0        ; 0x7fa00000
  %snan = bitcast i32 %sbits to float

  ; compare values
  call i1 @cmp(float 1.0, float %qnan)
  call i32 @printf(ptr @fmt)            ; we get here and print "ok"

  call i1 @cmp(float 1.0, float %snan)
  call i32 @printf(ptr @fmt)            ; but we crash before we get here

  ret i32 0
}

If we then compile and run this, we see that the second fcmp crashes the program (the exception is delivered to userspace as SIGFPE):

$ clang -O2 nan.ll -lm
$ ./a.out
ok
zsh: floating point exception (core dumped)  ./a.out

[ Show in compiler explorer ] (but note that CE doesn't show stdout if the program crashes)

You may have noticed that I hauled the fcmp into its own function annotated noinline noopt. Why did I do that?

Here and here the LLVM docs say:

The default LLVM floating-point environment assumes that traps are disabled and status flags are not observable. Therefore, floating-point math operations do not have side effects and may be speculated freely.

and:

Floating-point math operations are allowed to treat all NaNs as if they were quiet NaNs. For example, “pow(1.0, SNaN)” may be simplified to 1.0.

In other words, LLVM will optimise fcmp based on the assumption that SNaNs can't happen. Indeed, if in the above program you remove noinline optnone from the @cmp function, then the program no longer raises and exception!

The reason for this is that the optimiser will inline @cmp, assume the fcmp operands are not SNaN, apply constant propagation and arrive at a @main like so:

...
define i32 @main() local_unnamed_addr {
  %_ = tail call i32 @feenableexcept(i32 -1)
  %1 = tail call i32 @printf(ptr nonnull @fmt)
  %2 = tail call i32 @printf(ptr nonnull @fmt)
  ret i32 0
}
...

Since the float comparisons have been eliminated, there can be no floating point exceptions.

If you don't want LLVM to make these assumptions, you can use use LLVM's "constrained" floating point intrinsics, for example llvm.experimental.constrained.fcmp. I'll leave this as an exercise for the reader though.