This post is the fourth of a series examining GHC’s support for DWARF debug information and the tooling that this support enables:
- Part 1 introduces DWARF debugging information and explains how its generation can be enabled in GHC.
- Part 2 looks at a DWARF-enabled program in
gdb
and examines some of the limitations of this style of debug information. - Part 3 looks at the backtrace support of GHC’s runtime system and how it can be used from Haskell.
- Part 4 examines how the Linux
perf
utility can be used on GHC-compiled programs. - Part 5 concludes the series by describing future work, related projects, and ways in which you can help.
Profiling with perf
The final application of debug information that we will examine is
performance analysis, specifically profiling with the Linux perf
tool.
perf
is a statistical profiling tool relying on, among other things,
the underlying machine’s performance monitoring hardware. It can be used
not only to profile time, but also details of the microarchitecture such
as cache-misses, front-end stalls, etc. With a bit of cleverness, one
can even use perf
to profile allocations of a Haskell program.
Moreover, all of this profiling functionality comes with a negligible impact on the throughput of the profiled profiling. Of course, for all of these benefits one trades off precision and ease of interpretation of the resulting profile.
Attribute | perf |
cost-center profiler |
---|---|---|
Runtime impact | negligible | anywhere from moderate to high |
Relatability to source program | often hard to relate back to source program | profile structure directly reflects cost centers declared in source program |
Provides call graphs | yes, with limited depth | yes |
Provides call graphs on Haskell programs | not currently | yes |
Profiled variables | time, allocations, micro-architectural counters, system calls, user- and kernel-space probe points | time, allocations |
Determinism | profile is statistical and will likely vary from run to run | deterministic |
To see this trade-off in action, let’s return to our vector-tests-O0
example. We can acquire a simple time profile by simply running the
executable under perf record
:
$ perf record vector-tests-O0
The resulting profile can be examined via perf report
. This will show
you a TUI interface along the lines of
Samples: 145K of event 'cycles:ppp', Event count (approx.): 98781024840
Overhead Command Shared Object Symbol
10.19% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_quotInteger_info
5.02% vector-tests-O0 vector-tests-O0 [.] evacuate
3.58% vector-tests-O0 vector-tests-O0 [.] stg_upd_frame_info+0xffffffffffc00000
3.28% vector-tests-O0 vector-tests-O0 [.] QuickCheckzm2zi13zi2zmac90a2a0d9e0dd2c227d795a9d4d9de22a119c3781b679f3b245300e1b658c43_TestziQuickCheckziGen_zdwzdsgamma_info
2.96% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_timesInteger_info
2.93% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_minusInteger_info
2.67% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_testBitInteger_info
2.34% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_eqIntegerzh_info
2.30% vector-tests-O0 vector-tests-O0 [.] randomzm1zi1zmc60864d5616c60090371cdf8e600240f388e8a9bd87aa769d8045bda89826ee2_SystemziRandom_zdwrandomIvalInteger_info
2.18% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_plusInteger_info
2.16% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_divInteger_info
1.71% vector-tests-O0 vector-tests-O0 [.] QuickCheckzm2zi13zi2zmac90a2a0d9e0dd2c227d795a9d4d9de22a119c3781b679f3b245300e1b658c43_TestziQuickCheckziGen_zdwilog2_info
1.63% vector-tests-O0 vector-tests-O0 [.] stg_ap_p_info+0xffffffffffc00010
1.52% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_remInteger_info
1.36% vector-tests-O0 vector-tests-O0 [.] hs_popcnt64
1.34% vector-tests-O0 vector-tests-O0 [.] stg_PAP_apply
1.23% vector-tests-O0 vector-tests-O0 [.] stg_ap_0_fast
1.22% vector-tests-O0 vector-tests-O0 [.] stg_gc_noregs
1.20% vector-tests-O0 vector-tests-O0 [.] stg_newByteArrayzh
1.15% vector-tests-O0 vector-tests-O0 [.] stg_ap_pp_info+0xffffffffffc00000
1.11% vector-tests-O0 vector-tests-O0 [.] _randomzm1zi1zmc60864d5616c60090371cdf8e600240f388e8a9bd87aa769d8045bda89826ee2_SystemziRandom_b1_siHV_entry
0.98% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_geIntegerzh_info
0.98% vector-tests-O0 vector-tests-O0 [.] stg_BLACKHOLE_info+0xffffffffffc00069
0.92% vector-tests-O0 vector-tests-O0 [.] scavenge_block
...
This profile shows costs from both the runtime system and compiled
Haskell. If one selects an entry from this list with the arrow keys and
presses enter perf
will show the annotated assembler of this function,
along with the associated Haskell source.
In principle, perf record
can also give us call-graph information. We
will examine this in depth in the next section.
Paths to call-stack profiling
In addition to the flat profiles we saw earlier, perf
can also produce
call-graph profiles which sample not only the current location of
execution but also callers to the enclosing function. Such call-graph
profiles are produced with the perf record -g
command.
Unlike flat profiles, call-graph capture tends to be quite language- and
hardware-dependent. Consequently, perf record
currently provides three
mechanisms for collecting call-stacks:
via the frame pointer register: GHC doesn’t track the frame-pointer like many imperative languages do, so this is unusable on Haskell programs.
via the last branch record (LBR): This takes advantage of the LBR, a hardware feature of Intel CPUs, and is only really usable on modern (e.g. Skylake and later) hardware. However, even on Skylake it provides only a very limited stack depth (32 branches on Skylake, of which many will be uninteresting).
via DWARF unwinding: This uses DWARF unwinding information to decode the state of the stack, which is captured at sampling-time. This capture can result in non-trivial runtime overhead.
Recent Intel hardware also includes another, more expensive hardware
mechanism, the Branch Trace Store, which could be useful for deeper
call-graph acquisition but perf record
support currently does not
support using it for call-graph capture.
attribute | frame pointer | LBR | DWARF |
---|---|---|---|
maximum depth | unlimited | 32 on Skylake | unlimited |
runtime overhead | negligible | negligible | small |
compatible with GHC execution model | no | yes? | not currently |
Unfortunately, call-stack profiling support for Haskell is complicated
by the fact that currently GHC uses machine registers in a rather
non-standard way: rather than use the machine’s stack pointer register
(e.g. %rsp
on x86-64) to store the Haskell stack pointer GHC rather
reserves another register (%rbp
on x86-64). This choice simplifies
interaction with foreign code at the expense of one usable register.
While DWARF’s representation of unwinding information is sufficiently
flexible to encode GHC’s non-standard stack register, the unwinding
logic in the Linux kernel used by perf record
sadly is not. For this
reason, the unwind information produced by ghc -g
is of little use
when profiling with perf record -g
.
Other than DWARF unwinding, the only other viable call-graph option is
the last-branch register (LBR). Unfortunately, the rather restrictive
32-branch depth limit and the fact that GHC does not use the traditional
call
instruction means that in practice the call-graphs produced by
this method tend not to be very useful.
In sum, currently I do not have a prescription for call-graph profiling
of Haskell programs with perf
. The next and final
post will conclude this discussion of GHC’s debug
information support by looking at future directions (including solutions
to this call-graph problem) and other related projects.