tags: codegen jit profiling perf
Introduction
If you naively try and profile a program that generates code at runtime with Linux Perf, the entries in the profile for the JITted code probably won't look as you'd expect.
The reason for this is that Perf expects to be able to map virtual code addresses to symbols from ahead-of-time-compiled on-disk executables. For normal programs, that's not a tall order, since all of the executable code originates from either the "main binary executable file" or from shared object dependencies. For JITs that simply "magic up" executable code at runtime however, Perf can't (by default) do a good job for those code regions.
In this post I'll demonstrate how to instrument a toy JIT so that you can profile JITted code effectively using Perf.
An toy JIT
Let's demonstrate by profiling a program that generates and executes code at runtime. I'll do this in Rust because it's what I know best (imports elided for brevity):
fn jit_loop(reps: i32) -> ExecutableBuffer {
let mut dasm = Assembler::new().unwrap();
// Jitted function that counts from 0 up to `reps`, before returning the counter.
dynasm!(dasm
; mov rax, 0
; ->loop_start:
; cmp rax, reps
; je ->done
; add rax, 1
; jmp ->loop_start
; ->done:
; ret
);
dasm.finalize().unwrap()
}
fn main() {
if env::args().len() != 3 {
eprintln!("args: <iters1> <iters2>");
process::exit(1);
}
let mut args = env::args().skip(1);
let bound1 = args.next().unwrap().parse::<i32>().unwrap();
let bound2 = args.next().unwrap().parse::<i32>().unwrap();
let jitted_code1 = jit_loop(bound1);
let jitted_func1: extern "C" fn() -> i32 =
unsafe { mem::transmute(jitted_code1.ptr(AssemblyOffset(0))) };
let jitted_code2 = jit_loop(bound2);
let jitted_func2: extern "C" fn() -> i32 =
unsafe { mem::transmute(jitted_code2.ptr(AssemblyOffset(0))) };
eprintln!("Running jitted_func1():");
println!("returned {}", jitted_func1());
eprintln!("Running jitted_func2():");
println!("returned {}", jitted_func2());
}
I'm using the
dynasm
crate to generate
some simple X86_64 code at runtime. Each call to jit_loop()
generates a
function containing a loop that counts up from zero up to a dynamic upper bound read
from the command line,
before returning the final value of the counter1. We
generate two such functions, jitted_func1
and jitted_func2
, before
executing each of them and printing their results.
If we build this code2 and profile it using Perf, i.e.:
$ cargo build
...
$ perf record -g --call-graph=dwarf ./target/debug/jit 1000000000 2000000000
Running jitted_func1():
returned 1000000000
Running jitted_func2():
returned 2000000000
[ perf record: Woken up 198 times to write data ]
[ perf record: Captured and wrote 49.473 MB perf.data (6143 samples) ]
$ perf report -g
Then this is what the TUI will look like:
Samples: 6K of event 'cycles:P', Event count (approx.): 3026589925
Children Self Command Shared Object Symbol
+ 66.57% 66.57% jit [JIT] tid 781072 [.] 0x00007fa2c3f9b017
+ 33.30% 33.30% jit [JIT] tid 781072 [.] 0x00007fa2c3f9c017
0.07% 0.07% jit [JIT] tid 781072 [.] 0x00007fa2c3f9c007
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _start
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_start
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_start_final (inlined)"
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_sysdep_start
0.02% 0.00% jit ld-linux-x86-64.so.2 [.] dl_main
...
The first two entries are our JITted code, and as we would expect from the loop bounds I provided (1 billion and 2 billion), Perf reports roughly twice as many samples3 in the JITted code function that runs for twice as long.
But it's not all plain sailing. Notice that the JITted code entries don't have symbol names, just virtual addresses? This can make understanding the profile difficult, especially for more complicated profiles with lots of JITted code entries. Further, if you hit enter on a JITted code entry, and then choose "Annotate", then you will be greeted with an error, e.g.:
Couldn't annotate 00007fa2c3f9b017
Internal error: Invalid -1 error code
I'm also not sure what the third JITted code entry with 0.07%
overhead is all
about, but we can fix all of these problems by adding Perf support to the JIT.
Adding Perf support to our JIT
There are currently two ways to add Perf support to a JIT: with "JIT map" files, or with "JIT dump" files. The latter is the more modern and reccommended way, so that's what we'll use.
To that end, our JIT runtime has to:
- Create a JIT dump file called
jit-<pid>.dump
. mmap(2)
the dump file into the address spacePROT_EXEC
to "mark" the JIT dump file.- Write
CODE_LOAD
records into the JIT dump when JITted code is born.
Then once our JITted process has finished, there's an offline perf inject step required to get the JITted code into the profile4.
The mmap(2)
marker is the way that perf inject
finds which dump files to
look at. The profile is scanned for mmap(2)
events that map files named like
jit-<pid>.dump
into the address space. When it sees one that matches the PID
of the process being profiled, it writes the JITted code described within into
on-disk shared objects and ties the Perf profile to them5.
Implementation
With the high-level summary out of the way, let's implement this for our toy JIT. I'm going to elide liberally here, as quite a lot of the code isn't very interesting.
First, let's add a data structure that represents our JIT dump file. This will be responsible for creating a file in the format detailed in this this schema:
struct JitDump {
/// The jitdump file we will write records to.
jitdump: std::fs::File,
/// The current processes PID.
pid: u32,
}
The new()
method will create the file and write a JIT dump header:
impl JitDump {
/// Create a jitdump file for the current PID.
fn new() -> Result<Self, Box<dyn Error>> {
let pid = unsafe { getpid() } as u32;
let filename = format!("jit-{}.dump", pid);
let mut f = OpenOptions::new()
.create_new(true)
.read(true)
.write(true)
.open(&filename)?;
// Write the file header.
//
// magic
f.write_u32::<NativeEndian>(0x4A695444)?;
// version
f.write_u32::<NativeEndian>(1)?;
// total size of file header
let fh_size_pos = f.stream_position()?;
f.write_u32::<NativeEndian>(u32::MAX)?; // placeholder.
// ...elided...
// timestamp
f.write_u64::<NativeEndian>(now_timestamp())?;
// ...elided...
// Patch the file header size field now we know how big it is.
let end_fh_pos = f.stream_position()?;
f.seek(SeekFrom::Start(fh_size_pos))?;
f.write_u32::<NativeEndian>(u32::try_from(end_fh_pos).unwrap())?;
f.seek(SeekFrom::Start(end_fh_pos))?;
// For `perf inject` to work, we have to map the jitdump file into memory PROT_EXEC.
let opts = MmapOptions::new();
let _ = unsafe { opts.map_exec(&f).unwrap() };
Ok(Self { jitdump: f, pid })
}
}
A few notes if interest here:
-
One of the fields is the total size of the file header which we don't actually know until later, so we have to write a place-holder value and then patch it later.
-
One of the fields is a timestamp (in nanoseconds). For this you can choose whatever clock source you like, as long as you select the same clock in your
perf record
invocation. The docs recommend usingCLOCK_MONOTONIC
, so that's what I've used. Here's how I obtain a timestamp:
/// Create a timestamp of "now" in nanoseconds.
fn now_timestamp() -> u64 {
let mut ts = std::mem::MaybeUninit::<timespec>::uninit();
if unsafe { clock_gettime(CLOCK_MONOTONIC, ts.as_mut_ptr()) } != 0 {
panic!("failed to read clock");
}
let ts = unsafe { ts.assume_init() };
(ts.tv_sec as u64) * 1_000_000_000 + (ts.tv_nsec as u64)
}
Then the code to write the CODE_LOAD
record for newly JITted code looks like
this:
const JIT_CODE_LOAD: u32 = 0;
impl JitDump {
/// Add a CODE_LOAD record into the jitdump file.
fn emit_code_load_record(
&mut self,
code: &ExecutableBuffer,
code_idx: u64,
) -> Result<(), Box<dyn Error>> {
let pos_before = self.jitdump.stream_position()?;
// Write record header
//
// id (the kind of JIT event)
self.jitdump.write_u32::<NativeEndian>(JIT_CODE_LOAD)?;
// ...edlided...
// vma (virtual address of jitted code start)
#[cfg(target_pointer_width = "64")]
let code_start = code.ptr(AssemblyOffset(0)) as u64;
self.jitdump.write_u64::<NativeEndian>(code_start)?;
// code_addr (same as vma for us)
self.jitdump.write_u64::<NativeEndian>(code_start)?;
// code_size
self.jitdump
.write_u64::<NativeEndian>(u64::try_from(code.len()).unwrap())?;
// code_index
self.jitdump.write_u64::<NativeEndian>(code_idx)?;
// char[n]: function name (including the null terminator)
self.jitdump
.write_all(format!("jitted_func{}", code_idx).as_bytes())?;
self.jitdump.write_u8(0)?;
// native code
self.jitdump.write_all(code)?;
// ...elided...
Ok(())
}
}
Essentially here we give each chunk of JITted code a "code index" to uniquely identify it, a symbol name that we'd like to see in the Perf profile, the code's start address and length, and then the JITted executable code itself.
Next we have to think about instantiating our JitDump
data structure
somewhere. For better or worse, Perf requires the JITted code for the entire
the process (i.e. PID) to be written to the same JIT dump file. Were our JIT
multi-threaded, access to this JitDump
instance would therefore need to be
shared and synchronised so that CODE_LOAD
records wouldn't get interleaved
when two threads emit a CODE_LOAD
record simultaneously.
Our toy JIT isn't multi-threaded, but for demonstration purposes we'll use a thread-safe implementation anyway:
/// jit dump file for the current process.
static JIT_DUMP: LazyLock<Mutex<JitDump>> = LazyLock::new(|| Mutex::new(JitDump::new().unwrap()));
Then we can make our life easier by writing a little helper function to write freshly JITted code into the JIT dump:
/// Register JITted code with perf.
fn register_jitted_code(code: &ExecutableBuffer, code_idx: u64) {
JIT_DUMP
.lock()
.unwrap()
.emit_code_load_record(&code, code_idx)
.unwrap();
}
Then all that remains is to call register_jitted_code()
in the appropriate
places (in our case in main()
). You want to make sure you do this before
the JITted is executed.
...
let jitted_code1 = jit_loop(bound1);
register_jitted_code(&jitted_code1, 1); // <---
let jitted_func1: extern "C" fn() -> i32 =
unsafe { mem::transmute(jitted_code1.ptr(AssemblyOffset(0))) };
let jitted_code2 = jit_loop(bound2);
register_jitted_code(&jitted_code2, 2); // <---
let jitted_func2: extern "C" fn() -> i32 =
unsafe { mem::transmute(jitted_code2.ptr(AssemblyOffset(0))) };
...
That's it. Then we can build and profile our JITted code like so:
$ perf record -k CLOCK_MONOTONIC -g --call-graph=dwarf \
./target/debug/jit 1000000000 2000000000
...
Note -k
selects the clock source to use (in this case CLOCK_MONOTONIC
).
This must match the clock source used in the timestamps in your JIT dump files.
After this, we should see our jit-<pid>.dump
file on the filesystem:
$ ls *.dump
jit-851825.dump
Next we do the perf inject
step:
$ perf inject --jit --input perf.data --output perf.jit.data
Now we should have one on-disk shared object per chunk of JITted code:
$ ls *.so
jitted-851825-1.so jitted-851825-2.so
And finally, we can look at the profile report:
$ perf report -g -i perf.jit.data
Samples: 6K of event 'cycles:P', Event count (approx.): 3026404125
Children Self Command Shared Object Symbol
+ 66.61% 66.61% jit jitted-851825-2.so [.] jitted_func2
+ 33.34% 33.34% jit jitted-851825-1.so [.] jitted_func1
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _start
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_start
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_start_final (inlined)
0.03% 0.00% jit ld-linux-x86-64.so.2 [.] _dl_sysdep_start
0.02% 0.00% jit [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
0.02% 0.00% jit [kernel.kallsyms] [k] do_syscall_64
0.02% 0.00% jit ld-linux-x86-64.so.2 [.] dl_main
...
The JITted code entries now have symbol names and that mysterious third JITted code entry is gone!
Further, the annotate functionality now works too:
Samples: 6K of event 'cycles:P', 4000 Hz, Event count (approx.): 3026404125
jitted_func2 /home/vext01/tmp/tmp/jit2/jitted-851825-2.so [Percent: local period]
Percent │
│
│
│ Disassembly of section .text:
│
│ 0000000000000080 <jitted_func2>:
│ mov $0x0,%rax
0.02 │ 7:┌─→cmp $0x77359400,%rax
│ │↓ je 1c
│ │ add $0x1,%rax
99.98 │ └──jmp 7
│1c: ← ret
Summary
In this post I've shown that, with a fairly small amount of instrumentation, we can make JITted code work properly with Linux Perf.
I've pushed the full example code to Github
here. I've also included a
justfile
that will build, record, inject and
generate the report. Just run just
in the top-level directory.
Further reading
(All git links are permalinks to the neweset commit at the time of writing)
- Linux Perf patchset adding JIT dump support to Perf
- JITDUMP specification
- JVM JIT dump implementation
- V8 JIT dump implementation
- LLVM execution engine JIT dump implementation
- yk JIT dump implementation
-
I'm using the SysV ABI, so the value is returned in the
rax
register. ↩ -
I'm being lazy and using a debug build here so that we have DWARF symbols in the binary. For production profiling you'd want a release mode build with symbols, like this. ↩
-
Remember that in its default guise, Perf is a "sampling" (or "statistical") profiler, so the overhead percentages you see here are just a proxy for wallclock time. ↩
-
Apparently the Perf developers plan to automate the manual injection stage, but it appears to not have been implemented yet. For now you can automate injection using a wrapper script like in v8, or yk. ↩
-
I initially was under the impression that the marker had to be mapped for the whole time time that we'd be executing JITted code, but it turns out this is not the case. You can unmap the JIT dump file immediately if you want. That's what we do here, as when the mapping handle is dropped by Rust, the
Drop
implementation unmaps it. ↩