Towards system profiler support for GHC

Andreas Klebinger – Thursday, 22 July 2021

all community debugging ghc hasura profiling rts

Recently we have been working with Hasura towards making call stack profiling possible using perf and similar tools. For that purpose Andreas Klebinger implemented a prototype branch of GHC which brings us closer to such support.

In this blog post we discuss how we could adapt GHC’s register assignment to make perf usable on Haskell code, what benefits we could gain from this functionality, as well as the trade-offs associated with the change.

We also request feedback from the Haskell community on these trade-offs. Please let us know if you think this would be a good direction for GHC!

The status quo

The best bet for profiling Haskell programs currently is GHC’s profiling mode.

When it works, this mode works really well. However, it has some pretty big caveats:

It requires recompilation of the executable and all its libraries.
It comes with some overhead, as it changes the representation of heap objects, resulting in a significant cost in both residency and runtime when profiling.
Inserting cost centres into your program can interfere with optimizations. This means things like list fusion or inlining behaviour can differ vastly between a profiled and release build.

For some specific use cases one can work around the last point by using a profiling mode called ticky-ticky. Using ticky-ticky can be tricky. It’s not considered a stable feature of GHC and as such can change between versions. While it doesn’t interfere with core optimizations, it comes at a steep runtime cost and, being intended as a tool for language implementors, isn’t very user friendly. This rarely makes ticky-ticky the best option for regular users.

While the above profiling modes can be very helpful they do not cover all situations. What is missing is a low overhead way to characterize performance, especially runtime. Ideally something that works both on Haskell and other languages.

Statistical profilers

System profilers like perf on Linux facilitate such low overhead performance profiling/monitoring. Similar tools exist for all platforms, but for this blog post we will focus on perf as most of these points also apply to other platforms and programs.

perf not only allows users capture performance data about their applications, it’s also for the most part language agnostic. Allowing users to profile applications which are built with multiple languages. Moreover, it allows us to do all that without recompilation, or even without restarting of applications as it can be attached and detached from processes at any point.

It does this by using a kernel interface to take snapshots of an application’s state at a certain rate. These snapshots include things like the current instruction, the values currently in registers and the C call stack.

By taking a large number of samples, we can infer which functions we spend most time in. This already works on compiled Haskell code even today. However, while knowing that most time is spent in, e.g., Data.Map.insert can sometimes be helpful, it’s not always enough. For complex applications what we really want to know is why are we inserting into a map so much, or why are we spending so much time decoding JSON, for example.

While perf can’t answer the “why” question directly, if we can get access to the call stack we can infer much of the control flow, and with domain knowledge of an application this often helps answer the “why” question. This makes call stack information incredibly valuable. This information can also be expressed in powerful visualizations like flame graphs which can make it even more obvious what the performance bottlenecks are in a specific workload.

The bad news is that getting the call stack for Haskell code via perf currently doesn’t work at all.

Why `perf` doesn’t work for Haskell

perf uses a kernel interface (namely perf_events) to capture machine state (registers, C stack) and make the captured data available to tools (e.g. via perf record). These tools can then analyze the data for monitoring or profiling purposes.

However the interface to capture the machine state makes assumptions which simply don’t apply to Haskell code. The details could fill more than one blog post on their own but in rough terms:

Haskell’s execution model currently relies on keeping a (Haskell) stack, separate from the C stack.
- For a deep dive on this topic see the paper Implementing Lazy Functional Languages on Stock Hardware: The Spineless Tagless G-Machine, which covers the essentials of the execution model, while Faster laziness using dynamic pointer tagging contains some important changes made since.
When compiling to machine code GHC uses both the C stack, and the Haskell stack.
- The C stack is used for foreign calls, the runtime system, and spilling of temporaries during the execution of a Haskell function.
- The Haskell stack is used for control flow between pieces of Haskell code (e.g. one Haskell function calling another).
Since both stacks are referenced often from machine code, references to both stacks are kept in registers, currently (on x86-64) $rsp for the C stack and $rbp for the Haskell stack.
The perf interface allows for capturing only one stack, and it must be pointed to by the $rsp register.

While we could in theory change the kernel interface, it seems unlikely that such a change would be accepted, and it would take even longer for it to filter down to users of Haskell.

This means as-is we can’t capture the Haskell stack using the perf kernel interface. Unless we change some things about how GHC works.

Ben Gamari wrote about a few approaches on how one might get this to work in his DWARF blog post series. Currently we are looking into the “Using native stack pointer register” approach.

Capturing the Haskell stack

What do we need to do in order to make this work?

If we want to reuse tools based on perf we have to capture call stacks, and do it in a way which perf understands.

We could introduce an alternative method to capture the Haskell stack. For example, the Haskell runtime could periodically capture and record the current execution stacks (#10915).
We could change the register mapping such that $rsp points to the Haskell stack (#8272).

The advantage of the latter is that it should be easier to integrate into already existing tooling. E.g. if perf record can properly record the Haskell stack one would expect perf report and similar to just work.

Working with Hasura we have implemented a branch of GHC for which $rsp points at the Haskell stack during execution, in order to investigate the feasibility of this approach.

While we can run Haskell programs compiled this way there seem to be some issues of perf not fully understanding the captured stacks. We can see that perf has captured the Haskell stacks, but unlike gdb it doesn’t seem to recognize them as call stacks at this point. Before we spend time fixing this we would like feedback from the community to determine whether this change is worthwhile.

How does a call stack look?

Here is a call graph produced by first compiling GHC itself with our branch, then running it inside gdb and interrupting mid compilation to get a callstack:

#0  0x00007ffff4a57a6e in ghc_GHCziCoreziOptziSimplifyziEnv_zdwsimplBinder_info () at compiler/GHC/Core/Opt/Simplify/Env.hs:772
#1  0x00007ffff4a2d258 in rpjH_info () at compiler/GHC/Core/Opt/Simplify.hs:1575
#2  0x00007ffff4a2df70 in rpjI_info () at compiler/GHC/Core/Opt/Simplify.hs:1524
#3  0x00007ffff4a432b8 in rpjN_info () at compiler/GHC/Core/Opt/Simplify.hs:1002
#4  0x00007ffff4a32c98 in ghc_GHCziCoreziOptziSimplify_simplExpr2_info () at compiler/GHC/Core/Opt/Simplify.hs:1752
#5  0x00007ffff4a4c498 in rpjP_info () at compiler/GHC/Core/Opt/Simplify.hs:998
#6  0x00007ffff4a215a8 in sCVS_info () at compiler/GHC/Core/Opt/Simplify.hs:2944
#7  0x00007ffff4a22e78 in rpju_info () at compiler/GHC/Core/Opt/Simplify.hs:2944
#8  0x00007ffff4a24ac0 in rpjx_info () at compiler/GHC/Core/Opt/Simplify.hs:2799
#9  0x00007ffff4a32c98 in ghc_GHCziCoreziOptziSimplify_simplExpr2_info () at compiler/GHC/Core/Opt/Simplify.hs:1752
#10 0x00007ffff4a4c498 in rpjP_info () at compiler/GHC/Core/Opt/Simplify.hs:998
#11 0x00007ffff4a215a8 in sCVS_info () at compiler/GHC/Core/Opt/Simplify.hs:2944
#12 0x00007ffff4a22e78 in rpju_info () at compiler/GHC/Core/Opt/Simplify.hs:2944
#13 0x00007ffff4a24ac0 in rpjx_info () at compiler/GHC/Core/Opt/Simplify.hs:2799
#14 0x00007ffff4a32c98 in ghc_GHCziCoreziOptziSimplify_simplExpr2_info () at compiler/GHC/Core/Opt/Simplify.hs:1752
#15 0x00007ffff4a32c98 in ghc_GHCziCoreziOptziSimplify_simplExpr2_info () at compiler/GHC/Core/Opt/Simplify.hs:1752
#16 0x00007ffff4a32c98 in ghc_GHCziCoreziOptziSimplify_simplExpr2_info () at compiler/GHC/Core/Opt/Simplify.hs:1752
#17 0x00007ffff4a2c510 in rpjF_info () at compiler/GHC/Core/Opt/Simplify.hs:1627
#18 0x00007ffff4a25588 in ghc_GHCziCoreziOptziSimplify_simplExpr1_info () at compiler/GHC/Core/Opt/Simplify.hs:1179
#19 0x00007ffff4a2e008 in rpjI_info () at compiler/GHC/Core/Opt/Simplify.hs:1547
#20 0x00007ffff4a28490 in rpjE_info () at compiler/GHC/Core/Opt/Simplify.hs:354
#21 0x00007ffff4a4fa70 in rpjR_info () at compiler/GHC/Core/Opt/Simplify.hs:235
#22 0x00007ffff4a4fb08 in rpjR_info () at compiler/GHC/Core/Opt/Simplify.hs:229
#23 0x00007ffff4a4f3c8 in ghc_GHCziCoreziOptziSimplify_simplTopBindszuzdssimplzubinds_info () at compiler/GHC/Core/Opt/Simplify.hs:229
#24 0x00007ffff4a4ff00 in ghc_GHCziCoreziOptziSimplify_zdwsimplTopBinds_info () at compiler/GHC/Core/Opt/Simplify.hs:218
#25 0x00007ffff49e2348 in ss7d_info () at compiler/GHC/Core/Opt/Pipeline.hs:788
#26 0x00007ffff4a5fa30 in ghc_GHCziCoreziOptziSimplifyziMonad_initSmpl1_info () at compiler/GHC/Core/Opt/Simplify/Monad.hs:95

It’s worth noting that function names get mangled by GHC. ghc_GHCziCoreziOptziSimplifyziEnv_zdwsimplBinder_info is a function in the ghc library from the module GHC.Core.Opt.Simplify.Env named $wsimplBinder. See the ghc wiki for a more detailed explaination.

All the aBcD_info entries are functions introduced by GHC during optimizations via things like float-out. While their name carries no meaning, thankfully debug info still retains the source line where they originated from. If we think it’s helpful this means we can investigate these closer, but for now we will just ignore them.

The “cleaned up” call stack is this:

#0  0x00007ffff4a57a6e in ghc_GHCziCoreziOptziSimplifyziEnv_zdwsimplBinder_info () at compiler/GHC/Core/Opt/Simplify/Env.hs:772
#4  0x00007ffff4a32c98 in ghc_GHCziCoreziOptziSimplify_simplExpr2_info () at compiler/GHC/Core/Opt/Simplify.hs:1752
#9  0x00007ffff4a32c98 in ghc_GHCziCoreziOptziSimplify_simplExpr2_info () at compiler/GHC/Core/Opt/Simplify.hs:1752
#14 0x00007ffff4a32c98 in ghc_GHCziCoreziOptziSimplify_simplExpr2_info () at compiler/GHC/Core/Opt/Simplify.hs:1752
#15 0x00007ffff4a32c98 in ghc_GHCziCoreziOptziSimplify_simplExpr2_info () at compiler/GHC/Core/Opt/Simplify.hs:1752
#16 0x00007ffff4a32c98 in ghc_GHCziCoreziOptziSimplify_simplExpr2_info () at compiler/GHC/Core/Opt/Simplify.hs:1752
#18 0x00007ffff4a25588 in ghc_GHCziCoreziOptziSimplify_simplExpr1_info () at compiler/GHC/Core/Opt/Simplify.hs:1179
#23 0x00007ffff4a4f3c8 in ghc_GHCziCoreziOptziSimplify_simplTopBindszuzdssimplzubinds_info () at compiler/GHC/Core/Opt/Simplify.hs:229
#24 0x00007ffff4a4ff00 in ghc_GHCziCoreziOptziSimplify_zdwsimplTopBinds_info () at compiler/GHC/Core/Opt/Simplify.hs:218
#26 0x00007ffff4a5fa30 in ghc_GHCziCoreziOptziSimplifyziMonad_initSmpl1_info () at compiler/GHC/Core/Opt/Simplify/Monad.hs:95

Looking at this call stack as someone familiar with GHC’s source code I can immediately see that we are in the simplifier (Simplify.hs), currently in the process of walking over all the top level bindings (simplTopBinds) and simplifying their rhss (simplExpr). Likely recursing into some nested expression.

For GHC none of that is really surprising (most of the time is spent simplifying) but it shows how meaningful stack traces can be despite the limitations mentioned above.

What we can’t see from a single stack trace is how much time we spend in these functions compared to the rest of the compiler. That requires sampling the stack often, ideally via perf, which is our goal but doesn’t work quite yet.

Limitations of call-stack sampling in Haskell

Applied to Haskell there are some limitations we have to consider when looking at call stacks. None of these issues are show stoppers, but it’s good to go into this without expecting perfect C-like call stacks from perf support. The limitations we have to deal with are:

Stack chunks

First of Haskell stacks are chunked. A Haskell stack might not just be one block of memory. Instead a Haskell call stack can be split over multiple chunks of memory. This means even when capturing the Haskell stack we can capture at most one chunk if there are multiple when using perf.

What does this mean in practice? Instead of capturing the full call stack the callstack would look as if it started “in the middle”. This means we might not always be able to assign the cost of a function correctly if we fail to capture all of its callers.

However, we don’t expect this to be a big issue in practice.

Chunks are usually large enough to capture all or at least a good deal of control flow.
The chunk size can be controlled at program startup, so users can select defaults that allow all of their control flow to be captured.
Even partial call stacks can often be quite revealing, and speed up diagnosing of problems significantly.

For these reasons we don’t expect partial stack capture to be a cause for concern.

Tail calls

Haskell makes good use of tail calls, especially compared to C where these are rather rare. This means a function can “disappear” from the call stack. So while f might call g which in turn calls h it’s entirely possible for the call stack to look as if f had called h directly if we capture the call stack during hs execution. This is something to account for when looking at profiling results, but most of the time there are enough non-tail calls for us to still make sense of execution.

Lazy evaluation

Lazy evaluation presents a unique set of challenges when reasoning about performance. In particular, the syntactic order of calls (e.g. from HasCallStack) can differ significantly from the operational order of execution. While this can make call stacks harder to understand, in our experience the essential control flow of an application is usually still apparent.

Downsides of changing register assignment

Changing GHC’s register assignment to accommodate stack sampling is not without some costs. Below we try to characterise these costs based on experiences from our prototype patch.

Runtime performance

This is by far the biggest trade off. Because of differences in the x86-64 instruction encoding, using $rsp (the machine stack register) for the Haskell stack results in larger executables. Larger executables in this case also means more cache misses and slower execution.

For the nofib benchmark suite the changes to key performance metrics were (geometric mean, less is better):

Metric	Intel i7-6700K	Intel i7-4790K
Code size	+3.52%	+3.52%
Elapsed (wall) time	+0.45%	+1.01%

We performed measurements on an Intel i7-6700K (limited to 3.8 GHz to avoid thermal throttling) and an Intel i7-4790K (4.0 GHz, boosts up to 4.4 GHz).

Based on our experience with changes to the NCG in the past, and these two machines in particular, the difference between these measurements is explained well by the fact that the higher clock rate means cache misses have a more severe impact on runtime.

Lower clocked machines are likely to see a even smaller cost for this change, but for simplicity we think it’s fair to say these changes increase runtime on average by 0.5%-1% through their effect on code size.

The question to ask is then: If we already had perf support today, would we sacrifice that support for 0.5-to-1% gain in runtime performance? While that support comes at a slight cost in terms of possible performance, better profiling capabilities mean better optimized Haskell code. As a result we might end up with the average Haskell executable being faster than before after all is said and done!

Implementation complexity

LLVM support is likely to require a more complex implementation than our work so far. LLVM very much likes to control the C stack, so GHC simply using the C-stack register for the Haskell stack is unlikely to just work. At the very least we expect this to require changes to LLVM’s upstream in order to support the new calling convention, as well as changes to GHC’s own LLVM code generation. However, the amount of effort to make this work is not quite clear yet as our focus was to implement this based on the NCG first.

Theoretically it would also be possible to introduce a new “way”, so that GHC would support both the existing and new register assignments. This would perhaps delay the need to implement this for the LLVM backend. But it comes with its own problems. Most of all the fact that we would not be able to profile applications with perf unless they are compiled in the right way to begin with.

Request for feedback

We would like to ask you, the community using Haskell and GHC, if you think pursuing this further is worthwhile.

Is losing a bit of performance worth the benefits we gain? If we already had these benefits would we be willing to give them up for one more sliver of runtime performance?

So if you have opinions on such a change, one way or another, please let us know. We will definitely look at feedback on the ticket as well as on the ghc-devs and haskell-cafe mailing lists. We will also try to keep an eye on the respective Twitter/Reddit posts once they are up.