This post is the fourth of a series examining GHC’s support for DWARF debug information and the tooling that this support enables:
- Part 1 introduces DWARF debugging information and explains how its generation can be enabled in GHC.
- Part 2 looks at a DWARF-enabled program in
gdband examines some of the limitations of this style of debug information.
- Part 3 looks at the backtrace support of GHC’s runtime system and how it can be used from Haskell.
- Part 4 examines how the Linux
perfutility can be used on GHC-compiled programs.
- Part 5 concludes the series by describing future work, related projects, and ways in which you can help.
The final application of debug information that we will examine is performance analysis, specifically profiling with the Linux
perf is a statistical profiling tool relying on, among other things, the underlying machine’s performance monitoring hardware. It can be used not only to profile time, but also details of the microarchitecture such as cache-misses, front-end stalls, etc. With a bit of cleverness, one can even use
perf to profile allocations of a Haskell program.
Moreover, all of this profiling functionality comes with a negligible impact on the throughput of the profiled profiling. Of course, for all of these benefits one trades off precision and ease of interpretation of the resulting profile.
|Runtime impact||negligible||anywhere from moderate to high|
|Relatability to source program||often hard to relate back to source program||profile structure directly reflects cost centers declared in source program|
|Provides call graphs||yes, with limited depth||yes|
|Provides call graphs on Haskell programs||not currently||yes|
|Profiled variables||time, allocations, micro-architectural counters, system calls, user- and kernel-space probe points||time, allocations|
|Determinism||profile is statistical and will likely vary from run to run||deterministic|
To see this trade-off in action, let’s return to our
vector-tests-O0 example. We can acquire a simple time profile by simply running the executable under
$ perf record vector-tests-O0
The resulting profile can be examined via
perf report. This will show you a TUI interface along the lines of
Samples: 145K of event 'cycles:ppp', Event count (approx.): 98781024840 Overhead Command Shared Object Symbol 10.19% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_quotInteger_info 5.02% vector-tests-O0 vector-tests-O0 [.] evacuate 3.58% vector-tests-O0 vector-tests-O0 [.] stg_upd_frame_info+0xffffffffffc00000 3.28% vector-tests-O0 vector-tests-O0 [.] QuickCheckzm2zi13zi2zmac90a2a0d9e0dd2c227d795a9d4d9de22a119c3781b679f3b245300e1b658c43_TestziQuickCheckziGen_zdwzdsgamma_info 2.96% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_timesInteger_info 2.93% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_minusInteger_info 2.67% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_testBitInteger_info 2.34% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_eqIntegerzh_info 2.30% vector-tests-O0 vector-tests-O0 [.] randomzm1zi1zmc60864d5616c60090371cdf8e600240f388e8a9bd87aa769d8045bda89826ee2_SystemziRandom_zdwrandomIvalInteger_info 2.18% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_plusInteger_info 2.16% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_divInteger_info 1.71% vector-tests-O0 vector-tests-O0 [.] QuickCheckzm2zi13zi2zmac90a2a0d9e0dd2c227d795a9d4d9de22a119c3781b679f3b245300e1b658c43_TestziQuickCheckziGen_zdwilog2_info 1.63% vector-tests-O0 vector-tests-O0 [.] stg_ap_p_info+0xffffffffffc00010 1.52% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_remInteger_info 1.36% vector-tests-O0 vector-tests-O0 [.] hs_popcnt64 1.34% vector-tests-O0 vector-tests-O0 [.] stg_PAP_apply 1.23% vector-tests-O0 vector-tests-O0 [.] stg_ap_0_fast 1.22% vector-tests-O0 vector-tests-O0 [.] stg_gc_noregs 1.20% vector-tests-O0 vector-tests-O0 [.] stg_newByteArrayzh 1.15% vector-tests-O0 vector-tests-O0 [.] stg_ap_pp_info+0xffffffffffc00000 1.11% vector-tests-O0 vector-tests-O0 [.] _randomzm1zi1zmc60864d5616c60090371cdf8e600240f388e8a9bd87aa769d8045bda89826ee2_SystemziRandom_b1_siHV_entry 0.98% vector-tests-O0 vector-tests-O0 [.] integerzmwiredzmin_GHCziIntegerziType_geIntegerzh_info 0.98% vector-tests-O0 vector-tests-O0 [.] stg_BLACKHOLE_info+0xffffffffffc00069 0.92% vector-tests-O0 vector-tests-O0 [.] scavenge_block ...
This profile shows costs from both the runtime system and compiled Haskell. If one selects an entry from this list with the arrow keys and presses enter
perf will show the annotated assembler of this function, along with the associated Haskell source.
perf record can also give us call-graph information. We will examine this in depth in the next section.
Paths to call-stack profiling
In addition to the flat profiles we saw earlier,
perf can also produce call-graph profiles which sample not only the current location of execution but also callers to the enclosing function. Such call-graph profiles are produced with the
perf record -g command.
Unlike flat profiles, call-graph capture tends to be quite language- and hardware-dependent. Consequently,
perf record currently provides three mechanisms for collecting call-stacks:
via the frame pointer register: GHC doesn’t track the frame-pointer like many imperative languages do, so this is unusable on Haskell programs.
via the last branch record (LBR): This takes advantage of the LBR, a hardware feature of Intel CPUs, and is only really usable on modern (e.g. Skylake and later) hardware. However, even on Skylake it provides only a very limited stack depth (32 branches on Skylake, of which many will be uninteresting).
via DWARF unwinding: This uses DWARF unwinding information to decode the state of the stack, which is captured at sampling-time. This capture can result in non-trivial runtime overhead.
Recent Intel hardware also includes another, more expensive hardware mechanism, the Branch Trace Store, which could be useful for deeper call-graph acquisition but
perf record support currently does not support using it for call-graph capture.
|maximum depth||unlimited||32 on Skylake||unlimited|
|compatible with GHC execution model||no||yes?||not currently|
Unfortunately, call-stack profiling support for Haskell is complicated by the fact that currently GHC uses machine registers in a rather non-standard way: rather than use the machine’s stack pointer register (e.g.
%rsp on x86-64) to store the Haskell stack pointer GHC rather reserves another register (
%rbp on x86-64). This choice simplifies interaction with foreign code at the expense of one usable register. While DWARF’s representation of unwinding information is sufficiently flexible to encode GHC’s non-standard stack register, the unwinding logic in the Linux kernel used by
perf record sadly is not. For this reason, the unwind information produced by
ghc -g is of little use when profiling with
perf record -g.
Other than DWARF unwinding, the only other viable call-graph option is the last-branch register (LBR). Unfortunately, the rather restrictive 32-branch depth limit and the fact that GHC does not use the traditional
call instruction means that in practice the call-graphs produced by this method tend not to be very useful.
In sum, currently I do not have a prescription for call-graph profiling of Haskell programs with
perf. The next and final post will conclude this discussion of GHC’s debug information support by looking at future directions (including solutions to this call-graph problem) and other related projects.