Well-Typed Blog

Improvements to the ghc-debug terminal interface

matthew, zubin, hannes — Wed, 24 Apr 2024 00:00:00 GMT

ghc-debug is a debugging tool for performing precise heap analysis of Haskell programs (check out our previous post introducing it). While working on Eras Profiling, we took the opportunity to make some much needed improvements and quality of life fixes to both the ghc-debug library and the ghc-debug-brick terminal user interface.

To summarise,

ghc-debug now works seamlessly with profiled executables.
The ghc-debug-brick UI has been redesigned around a composable, filter based workflow.
Cost centers and other profiling metadata can now be inspected using both the library interface and the TUI.
More analysis modes have been integrated into the terminal interface such as the 2-level profile.

This post explores the changes and the new possibilities for inspecting the heap of Haskell processes that they enable. These changes are available by using the 0.6.0.0 version of ghc-debug-stub and ghc-debug-brick.

Recap: using `ghc-debug`

There are typically two processes involved when using ghc-debug on a live program. The first is the debuggee process, which is the process whose heap you want to inspect. The debuggee process is linked against the ghc-debug-stub package. The ghc-debug-stub package provides a wrapper function

withGhcDebug :: IO a -> IO a

that you wrap around your main function to enable the use of ghc-debug. This wrapper opens a unix socket and answers queries about the debuggee process’ heap, including transmitting various metadata about the debuggee, like the ghc version it was compiled with, and the actual bits that make up various objects on the heap.

The second is the debugger process, which queries the debuggee via the socket mechanism and decodes the responses to reconstruct a view of the debuggee’s Haskell heap. The most common debugger which people use is ghc-debug-brick, which provides a TUI for interacting with the debuggee process.

It is an important principle of ghc-debug that the debugger and debuggee don’t need to be compiled with the same version of GHC as each other. In other words, a debugger compiled once is flexible to work with many different debuggees. With our most recent changes debuggers now work seamlessly with profiled executables.

TUI improvements

Exploring Cost Center Stacks in the TUI

For debugging profiled executables, we added support for decoding profiling information in the ghc-debug library. Once decoding support was added, it’s easy to display the associated cost center stack information for each closure in the TUI, allowing you to interactively explore that chain of cost centers with source locations that lead to a particular closure being allocated. This gives you the same information as calling the GHC.Stack.whoCreated function on a closure, but for every closure on the heap! Additionally, ghc-debug-brick allows you to search for closures that have been allocated under a specific cost center.

As we already discussed in the eras profiling blog post, object addresses are coloured according to the era they were allocated in.

If other profiling modes like retainer profiling or biographical profiling are enabled, then the extra word tracked by those modes is used to mark used closures with a green line.

A filter based workflow

Typical ghc-debug-brick workflows would involve connecting to the client process or a snapshot and then running queries like searches to track down the objects that you are interested in. This took the form of various search commands available in the UI:

However, sometimes you would like to combine multiple search commands, in order to more precisely narrow down the exact objects you are interested in. Earlier you would have to do this by either writing custom queries with the ghc-debug Haskell API or modify the ghc-debug-brick code itself to support your custom queries.

Filters provide a composable workflow in order to perform more advanced queries. You can select a filter to apply from a list of possible filters, like the constructor name, closure size, era etc. and add it to the current filter stack to make custom search queries. Each filter can also be inverted.

We were motivated to add this feature after implementing support for eras profiling as it was often useful to combine existing queries with a filter by era. With these filters it’s easy to express your own domain specific queries, for example:

Find the Foo constructors which were allocated in a certain era.
Find all ARR_WORDS closures which are bigger than 1000 bytes.
Show me everything retained in this era, apart from ARR_WORDS and GRE constructors.

Here is a complete list of filters which are currently available:

Name	Input	Example	Action
Address	Closure Address	0x421c3d93c0	Find the closure with the specific address
Info Table	Info table address	0x1664ad70	Find all closures with the specific info table
Constructor Name	Constructor name	Bin	Find all closures with the given constructor name
Closure Name	Name of closure	sat_sHuJ_info	Find all closures with the specific closure name
Era	/-	13 or 9-12	Find all closures allocated in the given era range
Cost centre ID	A cost centre ID	107600	Finds all closures allocated (directly or indirectly) under this cost centre ID
Closure Size	Int	1000	Find all closures larger than a certain size
Closure Type	A closure type description	ARR_WORDS	Find all ARR_WORDS closures

All these queries are retainer queries which will not only show you the closures in question but also the retainer stack which explains why they are retained.

Improvements to profiling commands

ghc-debug-brick has long provided a profile command which performs a heap traversal and provides a summary like a single sample from a -hT profile. The result of this query is now displayed interactively in the terminal interface. For each entry, the left column in the header shows the type of closure in question, the total number of this closure type which are allocated, the number of bytes on the heap taken up by this closure, the maximum size of each of these closures and the average size of each allocated closure. The right column shows the same statistics, but taken over all closures in the current heap sample.

Each entry can be expanded, five sample points from each band are saved so you can inspect some closures which contributed to the size of the band. For example, here we expand the THUNK closure and can see a sample of 5 thunks which contribute to the 210,000 thunks which are live on this heap.

Support for the 2-level closure type profile has also been added to the TUI. The 2-level profile is more fine-grained than the 1-level profile as the profile key also contains the pointer arguments for the closure rather than just the closure itself. The key :[(,), :] means the list cons constructor, where the head argument is a 2-tuple, and the tail argument is another list cons.

For example, in the 2-level profile, lists of different types will appear as different bands. In the profile above you can see 4 different bands resulting from lists, of 4 different types. Thunks also normally appear separately as they are also segmented based on their different arguments. The sample feature also works for the 2-level profile so it’s straightforward to understand what exactly each band corresponds to in your program.

Other UI improvements

In addition to the new features discussed above, some other recent enhancements include:

Improved the performance of the main view when displaying a large number of rows. This noticeably reduces input lag while scrolling.
The search limit was hard-coded to 100 objects, which meant that only the first few results of a search would be visible in the UI. This limit is now configurable in the UI.
Additional analyses are now available in the TUI, such as finding duplicate ARR_WORDS closures, which is useful for identifying cases where programs end up storing many copies of the same bytestring.

Conclusion

We hope that the improvements to ghc-debug and ghc-debug-brick will aid the workflows of anyone looking to perform detailed inspections of the heap of their Haskell processes.

This work has been performed in collaboration with Mercury. Mercury have a long-term commitment to the scalability and robustness of the Haskell ecosystem and are supporting the development of memory profiling tools to aid with these goals.

Well-Typed are always interested in projects and looking for funding to improve GHC and other Haskell tools. Please contact info@well-typed.com if we might be able to work with you!

Choreographing a dance with the GHC specializer (Part 1)

finley — Mon, 15 Apr 2024 00:00:00 GMT

Specialization is an optimization technique used by GHC to eliminate the performance overhead of ad-hoc polymorphism and enable other powerful optimizations. However, specialization is not free, since it requires more work by GHC during compilation and leads to larger executables. In fact, excessive specialization can result in significant increases in compilation cost and executable size with minimal runtime performance benefits. For this reason, GHC pessimistically avoids excessive specialization by default and may leave relatively low-cost performance improvements undiscovered in doing so.

Optimistic Haskell programmers hoping to take advantage of these missed opportunities are thus faced with the difficult task of discovering and enacting an optimal set of specializations for their program while balancing any performance improvements with the increased compilation costs and executable sizes. Until now, this dance was a clunky one involving desperately wading through GHC Core dumps only to come up with a precarious, inefficient, unmotivated set of pragmas and/or GHC flags that seem to improve performance.

In this two-part series of posts, I describe the recent work we have done to improve this situation and make optimal specialization of Haskell programs more of a science and less of a dark art. In this first post, I will

give a comprehensive introduction to GHC’s specialization optimization,
explore the various facilities that GHC provides for observing and controlling it, and
present a simple framework for thinking about the trade-offs of specialization.

In the next post of the series, I will

present the new tools and techniques we have developed to diagnose performance issues resulting from ad-hoc polymorphism,
demonstrate how these new tools can be used to systematically identify useful specializations, and
make sense of their impact in terms of the framework described in this post.

The intended audience of this post includes intermediate Haskell developers who want to know more about specialization and ad-hoc polymorphism in GHC, and advanced Haskell developers who are interested in systematic approaches to specializing their applications in ways that minimize compilation cost and executable sizes while maximizing performance gains.

This work was made possible thanks to Hasura, who have supported many of Well-Typed’s successful initiatives to improve tooling for commercial Haskell users.

I presented a summary of the content in this post on The Haskell Unfolder:

The Haskell Unfolder Episode 23: specialisation

Overloaded functions are common in Haskell, but they come with a cost. Thankfully, the GHC specialiser is extremely good at removing that cost. We can therefore write high-level, polymorphic programs and be confident that GHC will compile them into very efficient, monomorphised code. In this episode, we’ll demystify the seemingly magical things that GHC is doing to achieve this.

Ad-hoc polymorphism

In Haskell, an ad-hoc polymorphic or overloaded function is one whose type contains class constraints. For example, this f is an overloaded function:

f :: (Ord a, Num a) => a -> a -> a
f x y =
    if x < y then
        x + y
    else
        x - y

For some type a such that Ord a and Num a instances are provided, f takes two values of type a and evaluates to another a.

Importantly, unlike type arguments, those class constraints are not erased at runtime! Actually, they will be passed to f just like any other value argument, meaning f at runtime is more like:

f :: Ord a -> Num a -> a -> a -> a
f ord_a num_a x y = ...

How does the definition of f change to represent this? And what do these ord_a and num_a values look like? This is how it works:

Instances are compiled to records, typically referred to as dictionaries, whose fields are the definitions provided in the instance.
Class functions (e.g. < in the body of f) become record selectors that are applied to the dictionaries to look up the appropriate definitions.

Thus, f at runtime is more like:

f :: Ord a -> Num a -> a -> a -> a
f ord_a num_a x y =
    if (<) ord_a x y then
        (+) num_a x y
    else
        (-) num_a x y

The previously-infix class operators are now applied in prefix position to select the appropriate definitions out of the dictionaries, which are then applied to the arguments.

We can see this for ourselves by compiling the definition of f in a module F.hs and emitting the intermediate representation (in GHC’s Core language):

ghc F.hs -O -dno-typeable-binds -dsuppress-all -dsuppress-uniques -ddump-ds

The -O flag enables optimizations, and the -ddump-ds flag tells GHC to dump the Core representation of the program after desugaring, before optimizations. The other flags make the output more readable.

For a comprehensive introduction to GHC Core and the flags GHC accepts for viewing it, check out The Haskell Unfolder Episode 9: GHC Core.

The above command will output the following Core for f:

f = \ @a $dOrd $dNum x y ->
      case < $dOrd x y of {
        False -> - $dNum x y;
        True -> + $dNum x y
      }

The if has been transformed into a case (Core has no if construct). The $dOrd and $dNum arguments are the Ord a and Num a instance dictionaries, respectively. The < operator is applied in prefix position (as are all operators in Core) to the $dOrd dictionary to get the appropriate implementation of <, which is further applied to x and y. The - and + operators in the branches of the case are similar.

The extra allocations required to pass these implicit dictionary arguments and apply selectors to them do result in a measurable overhead, albeit one that is insignificant for most intents and purposes. As we will see, the real cost of ad-hoc polymorphism comes from the optimizations it prevents rather than the overhead it introduces.

Specialization

In this context, specialization refers to the removal of ad-hoc polymorphism. When we specialize an overloaded expression e :: C a => S a, we create a new binding eT :: S T, where T is some concrete type for which a C T instance exists. Here eT is the specialization of e at (or to) type T.

For example, we can manually create a specialization of f at type Int. The source definition stays exactly the same, only the type changes:

fInt :: Int -> Int -> Int
fInt x y =
    if x < y then
        x + y
    else
        x - y

At the Core level, the dictionaries that were passed as value arguments to f are now used directly in the body of fInt. If we add the definition of fInt to our example module and compile it as we did before, we get the following output:

f = \ @a $dOrd $dNum x y ->
      case < $dOrd x y of {
        False -> - $dNum x y;
        True -> + $dNum x y
      }

fInt
  = \ x y ->
      case < $fOrdInt x y of {
        False -> - $fNumInt x y;
        True -> + $fNumInt x y
      }

fInt no longer accepts dictionary arguments, and instead references the global Ord Int and Num Int dictionaries directly. In fact, this definition of fInt is exactly what the GHC specializer would create if it decided to specialize f to Int. We can see this for ourselves by manually instructing GHC to do the specialization using a SPECIALIZE pragma. Our whole module is now:

module F where

{-# SPECIALIZE f :: Int -> Int -> Int #-}

f :: (Ord a, Num a) => a -> a -> a
f x y =
    if x < y then
        x + y
    else
        x - y

fInt :: Int -> Int -> Int
fInt x y =
    if x < y then
        x + y
    else
        x - y

And the -ddump-ds Core output becomes:

fInt
  = \ x y ->
      case < $fOrdInt x y of {
        False -> - $fNumInt x y;
        True -> + $fNumInt x y
      }

$sf
  = \ x y ->
      case < $fOrdInt x y of {
        False -> - $fNumInt x y;
        True -> + $fNumInt x y
      }

f = \ @a $dOrd $dNum x y ->
      case < $dOrd x y of {
        False -> - $dNum x y;
        True -> + $dNum x y
      }

The GHC generated specialization is named $sf (all specializations that GHC generates are prefixed by $s). Note that our specialization (fInt) and the GHC generated specialization ($sf) are exactly equivalent!

Why is this an optimization?

The above transformation really is all that the GHC specializer does to our programs. It may not be immediately clear why this optimization is a meaningful optimization at all. That is because specialization is an enabling optimization: The real benefit comes from the optimizations that it enables later in the pipeline, such as inlining.

Inlining is the replacement of defined (top-level or let-bound) variables with their definitions. Although f and its specialization $sf look similar, the key difference is that f includes calls to “unknown” functions passed as part of the dictionary arguments, while $sf includes calls to “known” functions contained in the $fOrdInt and $fNumInt dictionaries. Since GHC has access to the definitions of those dictionaries and the contained functions, they can be inlined, exposing yet more opportunities for optimization.

We can see this in action by comparing the fully optimized bindings of our example module to those just after desugaring. To do this, compile using the same command as above but add the -ddump-simpl flag, which tells GHC to dump the Core at the end of the Core optimization pipeline (also add -fforce-recomp to force recompilation, since we haven’t changed the code since our last compilation):

ghc F.hs -fforce-recomp -O -dno-typeable-binds -dsuppress-all -dsuppress-uniques -ddump-ds -ddump-simpl

The dumped output is:

==================== Desugar (after optimization) ====================
Result size of Desugar (after optimization)
  = {terms: 57, types: 37, coercions: 0, joins: 0/0}

fInt
  = \ x y ->
      case < $fOrdInt x y of {
        False -> - $fNumInt x y;
        True -> + $fNumInt x y
      }

$sf
  = \ x y ->
      case < $fOrdInt x y of {
        False -> - $fNumInt x y;
        True -> + $fNumInt x y
      }

f = \ @a $dOrd $dNum x y ->
      case < $dOrd x y of {
        False -> - $dNum x y;
        True -> + $dNum x y
      }

==================== Tidy Core ====================
Result size of Tidy Core
  = {terms: 44, types: 29, coercions: 0, joins: 0/0}

fInt
  = \ x y ->
      case x of { I# x1 ->
      case y of { I# y1 ->
      case <# x1 y1 of {
        __DEFAULT -> I# (-# x1 y1);
        1# -> I# (+# x1 y1)
      }
      }
      }

f = \ @a $dOrd $dNum x y ->
      case < $dOrd x y of {
        False -> - $dNum x y;
        True -> + $dNum x y
      }

------ Local rules for imported ids --------
"USPEC f @Int" forall $dNum $dOrd. f $dOrd $dNum = fInt

The output of the desugaring pass is in the “Desugar (after optimization)” section, while the fully optimized output is in the “Tidy Core” section. The name “Desugar (after optimization)” only means it is the desugared Core output after GHC’s simple optimizer has run. The simple optimizer only does very lightweight, pure transformations to the Core program. We will still refer to the Core output of this stage as “unoptimized”.

During the full optimization pipeline, GHC identified the equivalence between fInt and $sf and decided to remove $sf. The fully optimized binding for fInt is unboxing the Ints (pattern matching on the I# constructor) and using efficient primitive operations (<#, -#, +#), while the fully optimized binding for f is the same as the unoptimized binding. The optimizer simply couldn’t do anything with those opaque dictionaries in the way!

At the bottom of the output is the rewrite rule that the SPECIALIZE pragma created, which will cause any calls of f known to be at type Int to be rewritten as applications of fInt. This is what allows the rest of the program to benefit from the specialization. The rule simply discards the dictionary arguments $dNum :: Num Int and $dOrd :: Ord Int, which is safe because of global typeclass coherence: any dictionaries passed explicitly must have originally come from the same global instances.

In summary, by replacing the opaque dictionary arguments to f with references to the concrete Ord Int and Num Int dictionaries in fInt, GHC was able to do a lot more optimization later in the pipeline.

Automatic specialization

In our example module, we manually instructed GHC to generate a specialization of f at Int using a SPECIALIZE pragma. In reality, we often rely on GHC to figure out what specializations are necessary and generate them for us automatically. GHC needs to be careful though, since specialization requires the creation and optimization of more bindings, which increases compilation costs and executable sizes

GHC uses several heuristics to avoid excessive automatic specialization by default. The heuristics are very pessimistic, which means GHC can easily miss valuable specialization opportunities that programmers may wish to manually address. This is precisely the manual effort that our recent work aims to assist, so before we go any further it’s important that we understand exactly when and why GHC decides specialization should (or should not) happen.

When does automatic specialization happen?

GHC will only potentially attempt automatic specialization in exactly one scenario: An overloaded call at a concrete, statically known type is encountered (we’ll refer to such calls as “specializable” calls from now on). This means that automatic specialization will only ever be triggered at call sites, not definition sites. Even in this scenario, there are other factors to consider which the following example will demonstrate.

Let’s add a binding foo to our example module F.hs from above:

foo :: (Integer, Integer) -> Integer
foo (x, y) = f x y

foo makes a specializable call to f at the concrete type Integer, so we might expect automatic specialization to happen. However, the inliner beats the specializer to the punch here, which is evident in the -ddump-simpl output:

$wfoo
  = \ ww ww1 ->
      case integerLt ww ww1 of {
        False -> integerSub ww ww1;
        True -> integerAdd ww ww1
      }

foo = \ ds -> case ds of { (ww, ww1) -> $wfoo ww ww1 }

Instead of specializing, GHC decided to eliminate the call entirely by inlining f, thus exposing other optimization opportunities (such as worker/wrapper) which GHC took advantage of. This is intended, since f is so small and GHC knows that inlining it is very cheap and likely worth the performance outcomes.

Another way we can observe the inlining decision by GHC here is via the -ddump-inlinings flag, which causes GHC to dump the names of any bindings it decides to inline. Compiling our module with

ghc F.hs -O -fforce-recomp -ddump-inlinings

results in output indicating that GHC did decide to inline f:

Inlining done: F.f

To inline or to specialize?

GHC prefers inlining over specialization, when possible, since inlining eliminates calls and doesn’t require creation of new bindings. However, excessive inlining is often even more dangerous than excessive specialization. So, even when a specializable call is deemed too costly to inline, GHC will still attempt to specialize it.

We can artificially create such a scenario in our example by adjusting what GHC calls the “unfolding use threshold”. An “unfolding” is, roughly, the definition of a binding that GHC uses when it decides to inline or specialize calls to that binding. The unfolding use threshold governs the maximum effective size¹ of unfoldings that GHC will inline, and it can be manually adjusted using the -funfolding-use-threshold flag. Let’s set the unfolding use threshold to -1, essentially making GHC think all inlining is very expensive, and check the -ddump-simpl output:

ghc F.hs -O -fforce-recomp -ddump-simpl -funfolding-use-threshold=-1

As we can see, GHC did specialize the call:

...
f_$sf1
  = \ x y ->
      case integerLt x y of {
        False -> integerSub x y;
        True -> integerAdd x y
      }

foo = \ ds -> case ds of { (ww, ww1) -> f_$sf1 ww ww1 }

------ Local rules for imported ids --------
"SPEC f @Integer" forall $dOrd $dNum. f $dOrd $dNum = f_$sf1
...

The name of the specialization (f_$sf1) and the rewrite rule indicate that GHC did successfully automatically specialize the overloaded call to f.

Interestingly, the Core terms for foo and its specialization f_$sf are alpha-equivalent to the terms we arrived at when GHC inlined the call and applied worker/wrapper instead², with the specialization playing the same role as the worker.

Cross-module automatic specialization

We have now discussed two prerequisites for automatic specialization of a call:

The call must be specializable (i.e. it must be a call to an overloaded binding at a known type).
Other optimizations, such as inlining, that remove the call or otherwise ruin the specializability of the call must not fire before specialization can occur.

In fact, for specializable calls which occur in the definition module of the overloaded binding (as was the case in our previous example), these are the only prerequisites. When the overloaded binding is imported from another module (as is most often the case), there are additional prerequisites which we’ll discuss now.

Exposed unfoldings and the `INLINABLE` pragma

GHC performs separate compilation (as opposed to whole program compilation), compiling one Haskell module at a time. When GHC compiles a module, it produces not only compiled code in an object file, but also an interface file (with suffix .hi) . The interface file contains information about the module that GHC might need to reference when compiling other modules, such as the names and types of the bindings exported by the module. If certain criteria are met, GHC will include a binding’s unfolding in the module’s interface file so that it can be used later for cross-module inlining or specialization. Such unfoldings are referred to as exposed unfoldings.

Now, you might reasonably wonder: If unfoldings are used to do these powerful optimizations, why does GHC only expose unfoldings which meet some criteria? Why not expose all unfoldings? The reason is that during compilation, GHC holds the interfaces of every module in the program in memory. Thus, to keep GHC’s own default performance and memory usage reasonable, module interfaces need to be as small as possible while still producing well-optimized programs. One way that GHC achieves this is by limiting the size of unfoldings that get included in interface files so that only small unfoldings are exposed by default.

There’s another wrinkle here that impacts cross-module specialization: Even if GHC decides to expose an overloaded binding’s unfolding, and a specializable call to that binding occurs in another module, GHC will still never automatically specialize that call unless it has been given explicit permission to create the specialization. Such explicit permission can only be given in one of the following ways:

Mark the overloaded binding with either an INLINABLE or INLINE pragma.
Enable the -fspecialize-aggressively flag while compiling the calling module.

Let’s explore this fact by continuing with our example. Move foo, which makes a specializable call to f, to another module Foo.hs that has -funfolding-use-threshold set to -1 to fool the inliner as before:

{-# OPTIONS_GHC -funfolding-use-threshold=-1 #-}
module Foo where

import F

foo :: (Integer, Integer) -> Integer
foo (x, y) = f x y

Also remove everything from F.hs except f, for good measure:

module F where

f :: (Ord a, Num a) => a -> a -> a
f x y =
    if x < y then
        x + y
    else
        x - y

Since f is so small, we might expect GHC to expose its unfolding in the F.hi module interface by default. If we compile with just

ghc F.hs

we get the object file F.o and the interface file F.hi. We can determine whether GHC decided to expose the unfolding of f by viewing the contents of the interface file using GHC’s --show-iface option:

ghc --show-iface F.hi -dsuppress-all

Specific information for each binding in the module is listed towards the bottom of the output. The GHC Core of any exposed unfoldings will be displayed under their respective bindings. In this case, the information for f looks like this:

bcb4b04f3cbb5e6aa2f776d6226a0930
  f :: (Ord a, Num a) => a -> a -> a
  []

It only includes the type, no unfolding! This is because at GHC’s default optimization level of -O0, the -fomit-interface-pragmas and -fignore-interface-pragmas flags are enabled which prevent unfoldings (among other things) from being included in and read from the module interfaces. Recompile with optimizations enabled and check the module interface again:

ghc -O F.hs
ghc --show-iface F.hi -dsuppress-all

This time, GHC did expose the unfolding:

152dd20f273a86bea689edd6a298afe6
  f :: (Ord a, Num a) => a -> a -> a
  [...,
   Unfolding: Core: <vanilla>
              \ @a
                ($dOrd['Many] :: Ord a)
                ($dNum['Many] :: Num a)
                (x['Many] :: a)
                (y['Many] :: a) ->
              case < @a $dOrd x y of wild {
                False -> - @a $dNum x y True -> + @a $dNum x y }]

Remember, we still haven’t given GHC explicit permission to specialize calls to f across modules, so we should expect the fully optimized Core of Foo.hs to still include the overloaded call to f. Let’s check:

ghc Foo.hs -O -dno-typeable-binds -dsuppress-all -dsuppress-uniques -ddump-simpl

The dumped Core includes:

$wfoo = \ ww ww1 -> f $fOrdInteger $fNumInteger ww ww1

foo = \ ds -> case ds of { (ww, ww1) -> $wfoo ww ww1 }

Indeed, GHC applied the worker/wrapper transformation to foo, but was not able to specialize the call to f, despite it meeting our previously discussed prerequisites for automatic specialization.

There is a warning flag in GHC that can notify us of such a case: -Wall-missed-specializations. Compile Foo.hs again, including this flag:

ghc Foo.hs -O -fforce-recomp -Wall-missed-specializations

This will output the following warning:

Foo.hs: warning: [-Wall-missed-specialisations]
    Could not specialise imported function ‘f’
    Probable fix: add INLINABLE pragma on ‘f’

If we do what the warning says by adding an INLINABLE pragma on f, and dump the core of Foo.hs, we’ll see that automatic specialization succeeds:

$sf
  = \ x y ->
      case integerLt x y of {
        False -> integerSub x y;
        True -> integerAdd x y
      }

foo = \ ds -> case ds of { (ww, ww1) -> $sf ww ww1 }

------ Local rules for imported ids --------
"SPEC/Foo f @Integer" forall $dOrd $dNum. f $dOrd $dNum = $sf

Removing the INLINABLE pragma on f and instead enabling -fspecialize-aggressively has the same result.

The automatic specialization decision graph

We have now covered all the major prerequisites for automatic specialization. To summarize them, here is a decision graph illustrating the various ways that an arbitrary function call can trigger automatic specialization:

Now that we fully understand how, why, and when the GHC specializer works, we can move on to discussing the real problems that result from its behavior. Most of this discussion will be left for the next post in this series, but before concluding, I want to introduce something I call “the specialization spectrum”.

The specialization spectrum

Specialization is a very valuable compiler optimization, but I’ve mentioned many times throughout this post that excessive specialization can be a bad thing. This prompts a question: How do we know if we are appropriately benefitting from specialization? The meaning of “appropriately” here depends on application-specific requirements that dictate the desired size of our executables, how much we care about compilation costs, and how much we care about performance.

For example, if we want to maximize performance at all costs, we should make sure that we are generating and using the set of specializations that maximize the performance metrics we’re interested in, disregarding the increase in compilation costs and executable sizes.

Essentially, our goal is to find our ideal spot in the specialization spectrum.

The Specialization Spectrum

This is our search space, with performance on one axis and code size and compilation cost on the other. The plotted points represent important application-agnostic points in the spectrum. Those points are:

Baseline: Lowest performance and lowest cost. This point represents GHC’s default behavior where its heuristics will result in smaller code size and lower compilation cost but potentially miss specializations that would result in big performance wins.
Ideal: As the application authors, we get to choose the location of this point based on our priorities. Typically, we want this as “high and to the left” as possible.
Max performance: This point represents the optimal set of specializations, which will result in better runtime performance than any other set of specializations.
Max specialization: This point is the result of generating every³ possible specialization by enabling -fexpose-all-unfoldings and -fspecialize-aggressively. Importantly, this is not always equivalent to max performance! If we generate useless specializations that result in little to no performance improvements but do grow the code size, we can end up losing performance due to more code swapping in and out of CPU caches.

The dotted line illustrates an approximate “optimal path” representing the results we might see as we generate all specializations in order of decreasing performance improvement.

This framework makes it clear that this really is just an optimization problem, with all the normal issues of traditional optimization problems at play. Unfortunately, in the absence of good tools for exploring this spectrum, it is particularly easy for programmers to get lost and go down treacherous, unoptimal paths like this:

Such cases are deceptive, making the programmer think they have landed in a good spot when they are actually in a poor-performing local optimum. Fortunately, the tools and techniques we’ll discuss in the next post of this series will greatly simplify optimal search of the specialization spectrum.

Summary

This concludes our introductory exploration of specialization. Here’s what we have learned:

Calls to overloaded functions are compiled by passing dictionary values with a record of functions for each type class constraint.
Specialization removes type class dictionary arguments from an overloaded function and replaces references to them with references to a concrete dictionary instead.
Almost all of the benefit of specialization comes from the optimizations that it enables by replacing the opaque dictionary arguments with concrete dictionaries whose contents can be inlined.
GHC will only automatically specialize calls if a specific set of conditions holds. See the automatic specialization decision graph.
The specialization spectrum is a convenient framework for conceptualizing the impact of specialization on a program’s compilation cost and runtime performance.

In the next post of this series, we will apply all of what we have learned so far on some example applications, and demonstrate how the new tools we have developed can help us achieve optimal specialization and performance.

Footnotes

The effective size of an unfolding can be thought of as the number of terms in the Core representation of the unfolding, plus or minus some discounts that are applied depending on where GHC is considering inlining the unfolding.↩︎
This hints at a weak confluence of GHC Core and the reductions (i.e. optimizations) that the GHC optimizer applies to it.↩︎
Even with something like this in a cabal.project file:
```
package *
  ghc-options: -fexpose-all-unfoldings -fspecialize-aggressively
```
Some overloaded calls may still not get specialized! This can occur if a chain of calls to overloaded functions includes a call to an overloaded function in a GHC boot library that cannot be reinstalled by Cabal, e.g. base, which does not have its unfolding exposed. The only way to specialize such calls is to build boot libraries from source with -fexpose-all-unfoldings and -fspecialize-aggressively, and include the snippet above in a cabal.project file.

Additionally, some specific scenarios can cause overloaded calls to appear late in the optimization pipeline. To specialize those calls, -flate-specialise (British spelling required) is necessary, which runs another specialization pass at the end of GHC’s Core optimization pipeline.

Further, even after the above, some overloaded calls may still survive without -fpolymorphic-specialisation (British spelling required), which is known to be unsound at the time of writing. Unfortunately, in complex applications, total elimination of overloaded calls is still quite a difficult goal to achieve.↩︎

Haskell development job with Well-Typed

edsko, adam, andres, ben, duncan — Tue, 09 Apr 2024 00:00:00 GMT

tl;dr If you’d like a job with us, send your application as soon as possible.

We are looking for a Haskell expert to join our team at Well-Typed. We are seeking a strong all-round Haskell developer who can help us with various client projects (rather than particular experience in any one specific field). This is a great opportunity for someone who is passionate about Haskell and who is keen to improve and promote Haskell in a professional context.

About Well-Typed

We are a team of top notch Haskell experts. Founded in 2008, we were the first company dedicated to promoting the mainstream commercial use of Haskell. To achieve this aim, we help companies that are using or moving to Haskell by providing a range of services including consulting, development, training, support, and improvement of the Haskell development tools.

We work with a wide range of clients, from tiny startups to well-known multinationals. For some we do proprietary Haskell development and consulting. For others, much of the work involves open-source development and cooperating with the rest of the Haskell community. We have established a track record of technical excellence and satisfied customers.

Our company has a strong engineering culture. All our managers and decision makers are themselves Haskell developers. Most of us have an academic background and we are not afraid to apply proper computer science to customers’ problems, particularly the fruits of FP and PL research.

We are a self-funded company so we are not beholden to external investors and can concentrate on the interests of our clients, our staff and the Haskell community.

About the job

The role is not tied to a single specific project or task, is fully remote, and has flexible working hours.

In general, work for Well-Typed could cover any of the projects and activities that we are involved in as a company. The work may involve:

Haskell application development
Working directly with clients to solve their problems
Working on GHC, libraries and tools
Teaching Haskell and developing training materials

We try wherever possible to arrange tasks within our team to suit peoples’ preferences and to rotate to provide variety and interest. At present you are more likely to be working on general Haskell development than on GHC or teaching, however.

About you

Our ideal candidate has excellent knowledge of Haskell, whether from industry, academia or personal interest. Familiarity with other languages, low-level programming and good software engineering practices are also useful. Good organisation and ability to manage your own time and reliably meet deadlines is important. You should also have good communication skills.

You are likely to have a bachelor’s degree or higher in computer science or a related field, although this isn’t a requirement.

Further (optional) bonus skills:

familiarity with (E)DSL design,
knowledge of networking, concurrency and/or systems programming,
knowledge of and experience in applying formal methods,
experience of consulting or running a business,
experience in teaching Haskell or other technical topics,
experience with working on GHC,
experience with web programming
… (you tell us!)

Offer details

The offer is initially for one year full time, with the intention of a long term arrangement. Living in England is not required. We may be able to offer either employment or sub-contracting, depending on the jurisdiction in which you live. The salary range is 60k–100k GBP per year.

If you are interested, please apply by email to jobs@well-typed.com. Tell us why you are interested and why you would be a good fit for Well-Typed, and attach your CV. Please indicate how soon you might be able to start.

The deadline for applications is Tuesday April 30th 2024.

Calling Haskell from Swift

rodrigo — Tue, 02 Apr 2024 00:00:00 GMT

This is the second installment of the in-depth series of blog-posts on developing native macOS and iOS applications using both Haskell and Swift/SwiftUI. This post covers how to call (non-trivial) Haskell functions from Swift by using a foreign function calling-convention strategy similar to that described by Calling Purgatory from Heaven: Binding to Rust in Haskell that requires argument and result marshaling. You may find the other blog posts in this series interesting.

The series of blog posts is further accompanied by a github repository where each commit matches a step of this tutorial. If in doubt regarding any step, check the matching commit to make it clearer.

This write-up has been cross-posted to Rodrigo’s Blog.

Introduction

We’ll pick up from where the last post ended – we have set up an XCode project that includes our headers generated from Haskell modules with foreign exports and linking against the foreign library declared in the cabal file. We have already been able to call a very simple Haskell function on integers from Swift via Haskell’s C foreign export feature and Swift’s C interoperability.

This part concerns itself with calling idiomatic Haskell functions, which typically involve user-defined datatypes as inputs and outputs, from Swift. Moreover, these functions should be made available to Swift transparently, such that Swift calls them as it does other idiomatic functions, with user defined structs and classes.

For the running example, the following not-very-interesting function will suffice to showcase the method we will use to expose this function from Haskell to Swift, which easily scales to other complex data types and functions.

data User
  = User { name :: String
         , age  :: Int
         }

birthday :: User -> User
birthday user = user{age = user.age + 1}

The Swift side should wrap Haskell’s birthday:

struct User {
    let name: String
    let age: Int
}

// birthday(user: User(name: "Anton", age: 33)) = User(name: "Anton", age: 34)
func birthday(user: User) -> User {
    // Calls Haskell function...
}

To support this workflow, we need a way to convert the User datatype from Haskell to Swift, and vice versa. We are going to serialize (most) inputs and outputs of a function. Even though the serialization as it will be described may seem complex, it can be automated with Template Haskell and Swift Macros and packed into a neat interface – which I’ve done in haskell-swift.

As a preliminary step, we add the User data type and birthday function to haskell-framework/src/MyLib.hs, and the Swift equivalents to SwiftHaskell/ContentView.swift from the haskell-x-swift-project-steps example project.

Marshaling Inputs and Outputs

Marshaling the inputs and outputs of a function, from the Swift perspective, means to serialize the input values into strings, and receive the output value as a string which is then decoded into a Swift value. The Haskell perspective is dual.

Marshaling/serializing is a very robust solution to foreign language interoperability. While there is a small overhead of encoding and decoding at a function call, it almost automatically extends to, and enables, all sorts of data to be transported across the language boundary, without it being vulnerable to compiler implementation details and memory representation incompatibilities.

We will use the same marshaling strategy that Calling Purgatory from Heaven: Binding to Rust in Haskell does. In short, the idiomatic Haskell function is wrapped by a low-level one which deserializes the Haskell values from the argument buffers, and serializes the function result to a buffer that the caller provides. More specifically,

For each argument of the original function, we have a Ptr CChar and Int – a string of characters and the size of that string (a.k.a CStringLen)
For the result of the original function, we have two additional arguments, Ptr CChar and Ptr Int – an empty buffer in memory, and a pointer to the size of that buffer, both allocated by the caller.
For each argument, we parse the C string into a Haskell value that serves as an argument to the original function.
We call the original function
We overwrite the memory location containing the original size of the buffer with the required size of the buffer to fit the result (which may be smaller or larger than the actual size). If the buffer is large enough we write the result to it.
From the Swift side, we read the amount of bytes specified in the memory location that now contains the required size. If it turns out that the required size is larger than the buffer’s size, we need to retry the function call with a larger buffer.
- This means we might end up doing the work twice, if the original buffer size is not big enough. Some engineering work might allow us to re-use the result, but we’ll stick with retrying from scratch for simplicity.

We will use JSON as the serialization format: this choice is motivated primarily by convenience because Swift can derive JSON instances for datatypes out of the box (without incurring in extra dependencies), and in Haskell we can use aeson to the same effect. In practice, it could be best to use a format such as CBOR or Borsh which are binary formats optimised for compactness and serialization performance.

Haskell’s Perspective

Extending the User example requires User to be decodable, which can be done automatically by adding to the User declaration:

deriving stock Generic
deriving anyclass (ToJSON, FromJSON)

With the appropriate extensions and importing the necessary modules in MyLib:

{-# LANGUAGE DerivingStrategies, DeriveAnyClass #-}

-- ...

import GHC.Generics
import Data.Aeson

The MyForeignLib module additionally must import

import Foreign.Ptr
import Foreign.Storable
import Foreign.Marshal
import Data.Aeson
import Data.ByteString
import Data.ByteString.Unsafe

Now, let’s (foreign) export a function c_birthday that wraps birthday above in haskell-framework/flib/MyForeignLib.hs, using the described method.

First, the type definition of the function receives the buffer with the User argument, and a buffer to write the User result to. We cannot use tuples because they are not supported in foreign export declarations, but the intuition is that the first two arguments represent the original User input, and the two latter arguments represent the returned User.

c_birthday :: Ptr CChar -> Int -> Ptr CChar -> Ptr Int -> IO ()

Then, the implementation – decode the argument, encode the result, write result size to the given memory location and the result itself to the buffer, if it fits.

c_birthday cstr clen result size_ptr = do

We transform the (Ptr CChar, Int) pair into a ByteString using unsafePackCStringLen, and decode a User from the ByteString using decodeStrict:

  -- (1) Decode C string
  Just user <- decodeStrict <$> unsafePackCStringLen (cstr, clen)

We apply the original birthday function to the decoded user. In our example, this is a very boring function, but in reality this is likely a complex idiomatic Haskell function that we want to expose to.

  -- (2) Apply `birthday`
  let user_new = birthday user

We encode the new_user :: User as a ByteString, and use unsafeUseAsCStringLen to get a pointer to the bytestring data and its length. Finally, we get the size of the result buffer, write the actual size of the result to the given memory location, and, if the actual size fits the buffer, copy the bytes from the bytestring to the given buffer.

  -- (3) Encode result
  unsafeUseAsCStringLen (toStrict $ encode user_new) $ \(ptr,len) -> do

    -- (3.2) What is the size of the result buffer?
    size_avail <- peek size_ptr

    -- (3.3) Write actual size to the int ptr.
    poke size_ptr len

    -- (3.4) If sufficient, we copy the result bytes to the given result buffer
    if size_avail < len
       then do
         -- We need @len@ bytes available
         -- The caller has to retry
         return ()
       else do
         moveBytes result ptr len

If the written required size is larger than the given buffer, the caller will retry.

Of course, we must export this as a C function.

foreign export ccall c_birthday :: Ptr CChar -> Int -> Ptr CChar -> Ptr Int -> IO ()

This makes the c_birthday function wrapper available to Swift in the generated header and at link time in the dynamic library.

Swift’s Perspective

In Swift, we want to be able to call the functions exposed from Haskell via their C wrappers from a wrapper that feels idiomatic in Swift. In our example, that means wrapping a call to c_birthday in a new Swift birthday function.

In ContentView.swift, we make User JSON-encodable/decodable by conforming to the Codable protocol:

struct User: Codable {
    // ...
}

Then, we implement the Swift side of birthday which simply calls c_birthday – the whole logic of birthday is handled by the Haskell side function (recall that birthday could be incredibly complex, and other functions exposed by Haskell will indeed be).

func birthday(user: User) -> User {
    // ...
}

Note: in the implementation, a couple of blocks have to be wrapped with a do { ... } catch X { ... } but I omit them in this text. You can see the commit relevant to the Swift function wrapper implementation in the repo with all of these details included.

First, we encode the Swift argument into JSON using the Data type (plus its length) that will serve as arguments to the foreign C function.

let enc = JSONEncoder()
let dec = JSONDecoder()

var data: Data = try enc.encode(user)
let data_len = Int64(data.count)

However, a Swift Data value, which represents the JSON as binary data, cannot be passed directly to C as a pointer. For that, we must use withUnsafeMutableBytes to get an UnsafeMutableRawBufferPointer out of the Data – that we can pass to the C foreign function. withUnsafeMutableBytes receives a closure that uses an UnsafeMutableRawBufferPointer in its scope and returns the value returned by the closure. Therefore we can return the result of calling it on the user Data we encoded right away:

return data.withUnsafeMutableBytes { (rawPtr: UnsafeMutableRawBufferPointer) in
    // here goes the closure that can use the raw pointer,
    // the code for which we describe below
}

We allocate a buffer for the C foreign function to insert the result of calling the Haskell function, and also allocate memory to store the size of the buffer. We use withUnsafeTemporaryAllocation to allocate a buffer that can be used in the C foreign function call. As for withUnsafeMutableBytes, this function also takes a closure and returns the value returned by the closure:

// The data buffer size
let buf_size = 1024048 // 1024KB

// A size=1 buffer to store the length of the result buffer
return withUnsafeTemporaryAllocation(of: Int.self: 1) { size_ptr in
    // Store the buffer size in this memory location
    size_ptr.baseAddress?.pointee = buf_size

    // Allocate the buffer for the result (we need to wrap this in a do { ...} catch for reasons explained below)
    do {
        return withUnsafeTemporaryAllocation(byteCount: buf_size, alignment:1) { res_ptr in

            // Continues from here ...
        }
    } catch // We continue here in due time ...
}

We are now nested deep within 3 closures: one binds the pointer to the argument’s data, the other the pointer to the buffer size, and the other the result buffer pointer. This means we can now call the C foreign function wrapping the Haskell function:

c_birthday(rawPtr.baseAddress, data_len, res_ptr.baseAddress, size_ptr.baseAddress)

Recalling that the Haskell side will update the size pointed to by size_ptr to the size required to serialize the encoded result, we need to check if this required size exceeds the buffer we allocated, or read the data otherwise:

if let required_size = size_ptr.baseAddress?.pointee {
    if required_size > buf_size {
        // Need to try again
        throw HsFFIError.requiredSizeIs(required_size)
    }
}

return dec.decode(User.self, from: Data(bytesNoCopy: res_ptr.baseAddress!,
                    count: size_ptr.baseAddress?.pointee ?? 0, deallocator: .none))

where HsFFIError is a custom error defined as

enum HsFFIError: Error {
    case requiredSizeIs(Int)
}

We must now fill in the catch block to retry the foreign function call with a buffer of the right size:

} catch HsFFIError.requiredSizeIs(let required_size) {
    return withUnsafeTemporaryAllocation(byteCount: required_size, alignment:1)
    { res_ptr in
        size_ptr.baseAddress?.pointee = required_size
        c_birthday(rawPtr.baseAddress, data_len, res_ptr.baseAddress, size_ptr.baseAddress)

        return dec.decode(User.self, from: Data(bytesNoCopy: res_ptr.baseAddress!,
                    count: size_ptr.baseAddress?.pointee ?? 0, deallocator: .none))
    }
}

That seems like a lot of work to call a function from Haskell! However, despite this being a lot of code, not a whole lot is happening: we simply serialize the argument, allocate a buffer for the result, and deserialize the result into it. In the worst case, if the serialized result does not fit (the serialized data has over 1M characters), then we naively compute the function a second time (it should not be terribly complicated to avoid this work by caching the result and somehow resuming the serialization with the new buffer). Furthermore, there is a lot of bureocracy in getting the raw pointers to send off to Haskell land – the good news is that all of this can be automated away behind automatic code generation with Template Haskell and Swift Macros.

Expand for the complete function

The Haskell Unfolder Episode 22: foldr-build fusion

andres, edsko — Wed, 20 Mar 2024 00:00:00 GMT

Today, 2024-03-20, at 1930 UTC (12:30 pm PDT, 3:30 pm EST, 7:30 pm GMT, 20:30 CET, …) we are streaming the 22th episode of the Haskell Unfolder live on YouTube.

The Haskell Unfolder Episode 22: foldr-build-fusion

When composing several list-processing functions, GHC employs an optimisation called foldr-build fusion. Fusion combines functions in such a way that any intermediate lists can often be eliminated completely. In this episode, we will look at how this optimisation works, and at how it is implemented in GHC: not as built-in compiler magic, but rather via user-definable rewrite rules.

About the Haskell Unfolder

The Haskell Unfolder is a YouTube series about all things Haskell hosted by Edsko de Vries and Andres Löh, with episodes appearing approximately every two weeks. All episodes are live-streamed, and we try to respond to audience questions. All episodes are also available as recordings afterwards.

We have a GitHub repository with code samples from the episodes.

And we have a public Google calendar (also available as ICal) listing the planned schedule.

GHC activities report: December 2023–February 2024

adam, andreask, ben, finley, hannes, matthew, rodrigo, sam, zubin — Fri, 08 Mar 2024 00:00:00 GMT

This is the twenty-second edition of our GHC activities report, which describes the work on GHC, Cabal and related projects that we are doing at Well-Typed. The current edition covers roughly the months of December 2023 to February 2024. You can find the previous editions collected under the ghc-activities-report tag.

Many thanks to our sponsors who make this work possible: Anduril, Hasura and Juspay. In addition, we are grateful to Mercury for funding specific work on improved performance for developer tools on large codebases, and to the Sovereign Tech Fund for funding work on Cabal.

However, we need more sponsorship to sustain the team! If your company might be able to contribute funding to sustain this work, please read about how you can help or get in touch.

Of course, Haskell tooling is a large community effort, and Well-Typed’s contributions are just a small part of this. This report does not aim to give an exhaustive picture of all GHC work that is ongoing, and there are many fantastic features currently being worked on that are omitted here simply because none of us are currently involved in them. Furthermore, the aspects we do mention are still the work of many people. In many cases, we have just been helping with the last few steps of integration. We are immensely grateful to everyone contributing to GHC!

Team

The GHC team at Well-Typed currently consists of Ben Gamari, Andreas Klebinger, Matthew Pickering, Zubin Duggal, Sam Derbyshire and Rodrigo Mesquita, with Hannes Siebenhandl joining the team in January and Finley McIlwaine moving to another client project. In addition, many others within Well-Typed are contributing to GHC more occasionally.

Releases

Zubin released GHC 9.6.4 in January and GHC 9.8.2 in February. We are now working towards the release of GHC 9.10 later in the year. Check out the GHC status page for more information on release plans.

Eras profiling

Matthew and Zubin recently implemented a new profiling mode, eras profiling, that can give insight into when particular objects are allocated. This can be a great boon in diagnosing memory leaks in long-running programs.

Check out our blog post introducing eras profiling for more information about this new feature, and an exploration of how we used this new profiling mode to diagnose a memory leak in GHCi. Matthew also used eras profiling to diagnose a space leak in GHC’s simplifier (!11914).

The combination of eras profiling and ghc-debug works particularly well for analysing memory leaks, so Zubin has been making various improvements to ghc-debug (MR 32), including improving how it handles profiled executables (MR 35, MR 36).

A new home for GHC’s internals

GHC’s base library has long served a dual purpose: on one hand it is the user-facing standard library interface, but at the same time it contains many internal details used to implement the standard library. This dual purpose lead to problems for both implementors and users alike, as internal interfaces are freely interspersed with long-stable interfaces intended for general consumption. Even worse, the documentation of base often provided little guidance to users regarding which interfaces fell into which category.

Earlier this year, the Core Libraries Committee and GHC Team agreed a path to improve this situation by splitting base into three libraries: base, ghc-internal, and ghc-experimental. Our hope is that this approach will allow us to solve several problems at once:

base gives users a clearly-demarcated set of stable interfaces, overseen by the Core Libraries Commiteee.
ghc-experimental gives developers of new language and library features a dedicated place to iterate on their designs while still allowing usage to users willing to accept a slightly lower degree of stability.
ghc-internal provides a home for internal implementation details that are not intended for consumption by users, and potentially change from release to release.

Ben has been working on implementing this split by separating out definitions that belong in the ghc-internal package (!11400). This split has lead to a number of improvements across the ecosystem, ranging from Haddock improvements (see Haddock issues 1629, 1630) to compiler bug-fixes (#24436) and implementation cleanups (#24472).

Exception backtraces

Ben has been working to land his long-running and long-awaited Exception Backtrace Proposal (!8869) following extensive discussions with the Core Libraries Committee. This is expected to form part of GHC 9.10 and will be a major step towards making exception diagnosis easier for users.

GHC Steering Committee and `GHC2024`

Adam has now taken on the role of Secretary to the GHC Steering Committee, following Joachim Breitner stepping down after many years of dedicated service in the role. His first major task as secretary has been seeking new volunteers to serve on the commitee. If you would be interested, please read more and get in touch.

The committee has updated the collection of recommended language extensions by introducing GHC2024. GHC 9.10 will ship with GHC2024 available (!12084), but it is unclear when it will become the default (see ghc-proposals MR 632).

STM correctness and performance

Andreas has been diagnosing progress and performance issues with STM prompted by a user reporting STM starvation problems (#24142). In particular:

STM transaction performance scales badly with the number of TVars involved (#24410), because the current implementation uses a linked list to keep track of all TVars used by a transaction. Ben explored one approach for improving this situation, using a hashmap for these lookups (!12030).
Transactions with a large number of TVars may perform badly (#24427) due to a check performed by the RTS each time Haskell threads return to the scheduler. This check identifies potentially non-terminating STM transactions by validating the transaction’s view of the STM memory against the memory’s current state. While very useful, this check is somewhat costly to perform, and under the current implementation can also lead to false negatives when multiple validations happen in parallel. It is likely that the best solution for this issue is to perform validations less frequently, especially on long running transactions.
In pathological cases, two transactions run in parallel may be unable to make progress (#24446), even if all transactions are read only. This should be solvable with a rework of how TVars are locked during validation.

Unfortunately, fixing these issues will require further work.

Specialisation and late plugins

Finley has been exploring techniques to make it easier to diagnose issues with specialisation in large applications, such as poor runtime performance due to overloaded calls not being specialised. One workaround for such problems is exposing all unfoldings and using aggresive specialisation, but this tends to lead to poor compile-time performance instead.

Motivated by these investigations he added “late plugins”, which are plugins that are run at the very end of the Core pipeline, after the addition of late cost centres (!11765). This allows plugins to analyse and modify the Core that is compiled down to STG, without the changes ending up in interface files.

Cabal

Matthew, Rodrigo and Sam have been working to address longstanding architectural and maintenance issues in the Cabal library and the cabal-install build tool. This work is being supported by the Sovereign Tech Fund as discussed in our previous blog post.

Some of the changes have included:

Designing and implementing a new build-type: Hooks feature to provide a path towards deprecating build-type: Custom. Based on community feedback, Sam iterated on the design, with a particular focus on pre-build rules, arriving at a design inspired by Cloud Haskell, using static pointers. See the detailed HF Tech Proposal for an in-depth explanation of the design and its benefits. The implementation is now being prepared for review (PR 9551).
Disentangling implicit global state from the Cabal library, allowing it to take a working directory as an argument instead of using the working directory of the current process (PR 9718). This is intended to allow directly calling the Cabal library to build packages in a concurrent setting.
Working on a design and prototype implementation for private dependencies (issue 4035), allowing packages to express the fact that they do not expose any types from a dependency in their API. This gives greater flexibility to construct build plans, potentially making library version upgrades easier, and allows tests and benchmarks to compare different versions of the same library.
Making the testsuite more robust, including refactoring it to run tests in a separate temporary directory so they are not influenced by the external configuration of the user’s system (PR 9717).
Allowing per-component builds with Haskell Program Coverage (HPC) information (PR 9464).
Refactoring to eliminate long-standing code duplication that was a regular source of bugs in the logic for building components (PR 9602) and in glob support (PR 9673).
Fixing several longstanding bugs with the install command often ignoring CLI flags (PR 9697).
Robustly handling the same GHC version having been compiled from source multiple times (PR 9618), as the GHC version number is not enough to ensure ABI-compatibility.
Many more bug fixes and refactorings to improve maintainability and robustness of the codebase (e.g. PR 9524 PR 9554).

GHC bug fixes

Ben investigated memory-ordering issues using ThreadSanitizer and fixed numerous data races (!9372, !11795, !11768).
Ben fixed a thread-safety issue due to GHC’s use of the C strerror utility (#24344).
Sam fixed a 9.8 regression in shadowing error messages involving record fields with no field selectors (!11981).
Hannes fixed a 9.8 regression in how Haddock resolves qualified references (!11920).
Zubin fixed a regression in which GHC reported a poor error message in the presence of module cycles including hs-boot files (!11718, !11792).
Zubin fixed cross-module module breakpoints using incorrect cost centres (!11892).
Sam and Andreas fixed a variety of bugs in the handling of fused-multiply-add primops that were added in GHC 9.8.1 (!11587, !11893, !11902, !11987).
Ben fixed a subtle bug in the implementation of unique generation on 32-bit platforms (!11802).
Andreas fixed a bug in the C foreign-function interface that was introduced by using sub-word-sized arguments (!11989).
Zubin set -DPROFILING when compiling C++ sources with profiling (!11871).
Matthew fixed an off-by-one error when handling info-table provenance entries (!11873).
Zubin fixed a bug with ghcup-metadata generation (!11791).
Zubin updated the users’ guide to take into account the unrestricted overloaded labels GHC proposal, which landed in GHC 9.6 (!11774).
Hannes fixed a bug arising from GHC being installed at a filepath that includes spaces on Windows (!11938).

Build system, CI and distribution improvements

Ben carried out a number of submodule bumps in preparation for the GHC 9.10 release.
Rodrigo allowed the configure script to use autoconf 2.72 (!11942).
Matthew fixed a bug in the configuration of hsc2hs when building GHC, which was the source of linker errors (#24050, !11384).
Matthew updated the CI images, with a particular focus on improving the testing of the LLVM backend on CI (#24369, !11976).
Matthew ensured that documentation is built on more configuration in CI (e.g. on alpine, rocky8, Windows, Darwin) (!12134).
Ben adapted GHC to LLVM’s new pass manager CLI (!8999).

The Haskell Unfolder Episode 21: testing without a reference

andres, edsko — Wed, 06 Mar 2024 00:00:00 GMT

Today, 2024-03-06, at 1930 UTC (11:30 am PST, 2:30 pm EST, 7:30 pm GMT, 20:30 CET, …) we are streaming the 21th episode of the Haskell Unfolder live on YouTube.

The Haskell Unfolder Episode 21: testing without a reference

The best case scenario when testing a piece of software is when we have a reference implementation to compare against. Often however such a reference is not available, begging the question how to test a function if we cannot verify what that function computes exactly. In this episode we will consider how to define properties to verify the implementation of Dijkstra’s shortest path algorithm we discussed in Episode 20; you may wish to watch that episode first, but it’s not required: we will mostly treat the algorithm as a black box for the sake of testing it.

About the Haskell Unfolder

We have a GitHub repository with code samples from the episodes.

And we have a public Google calendar (also available as ICal) listing the planned schedule.

The Haskell Unfolder Episode 20: Dijkstra's shortest paths

andres, edsko — Wed, 21 Feb 2024 00:00:00 GMT

Today, 2024-02-21, at 1930 UTC (11:30 am PST, 2:30 pm EST, 7:30 pm GMT, 20:30 CET, …) we are streaming the 20th episode of the Haskell Unfolder live on YouTube.

The Haskell Unfolder Episode 20: Dijkstra’s shortest paths

In this (beginner-friendly) episode, we will use Dijkstra’s shortest paths algorithm as an example of how one can go about implementing an algorithm given in imperative pseudo-code in idiomatic Haskell. We will focus on readability, not on performance.

About the Haskell Unfolder

We have a GitHub repository with code samples from the episodes.

And we have a public Google calendar (also available as ICal) listing the planned schedule.

The Haskell Unfolder Episode 19: a new perspective on foldl'

andres, edsko — Wed, 31 Jan 2024 00:00:00 GMT

Today, 2024-01-31, at 1930 UTC (11:30 am PST, 2:30 pm EST, 7:30 pm GMT, 20:30 CET, …) we are streaming the 19th episode of the Haskell Unfolder live on YouTube.

The Haskell Unfolder Episode 19: a new perspective on foldl’

In this beginner-oriented episode we introduce a useful combinator called repeatedly, which captures the concept “repeatedly execute an action to a bunch of arguments”. We will discuss both how to implement this combinator as well as some use cases.

About the Haskell Unfolder

We have a GitHub repository with code samples from the episodes.

And we have a public Google calendar (also available as ICal) listing the planned schedule.

Eras profiling for GHC

matthew, zubin — Thu, 25 Jan 2024 00:00:00 GMT

Memory detectives now have many avenues of investigation when looking into memory usage problems in Haskell programs. You might start by looking at what has been allocated: which types of closures and which constructors are contributing significantly to the problem. Then perhaps it’s prudent to look at why a closure has been allocated by the info table provenance information. This will tell you from which point in the source code your allocations are coming from. But if if you then turned to investigate when a closure was allocated during the lifecycle of your program, you end up being stuck.

Existing Haskell heap profiling tools work by taking regular samples of the heap to generate a graph of heap usage over time. This can give an aggregate view, but makes it difficult to determine when an individual closure was allocated.

Eras profiling is a new GHC profiling mode that will be available in GHC 9.10 (!11903). For each closure it records the “era” during which it was allocated, thereby making it possible to analyse the points at which closures are allocated much more precisely.

In this post, we are going to explore this profiling mode that makes it easier to find space leaks and identify suspicious long lived objects on the heap. We have discussed ghc-debug before, and we are going to make use of it to explore the new profiling mode using some new features added to the ghc-debug-brick TUI.

Introduction to eras profiling

The idea of eras profiling is to mark each closure with the era it was allocated. An era is simply a Word. The era can then be used to create heap profiles by era and also inspected by ghc-debug.

To enable eras profiling, you compile programs with profiling enabled and run with the +RTS -he option.

The era starts at 1, then there are two means by which it can be changed:

User: The user program has control to set the era explicitly (by using functions in GHC.Profiling.Eras).
Automatic: The era is incremented at each major garbage collection (enabled by --automatic-era-increment).

The user mode is most useful as this allows you to provide domain specific eras. There are three new primitive functions exported from GHC.Profiling.Eras for manipulating the era:

setUserEra :: Word -> IO ()
getUserEra :: IO Word
incrementUserEra :: Word -> IO Word

Note that the current era is a program global variable, so if your program is multi-threaded then setting the era will apply to all threads.

Below is an example of an eras profile rendered using eventlog2html. The eras have been increased by the user functions, and the programmer has defined 4 distinct eras.

Diagnosing a GHCi memory leak

Recently, we came across a regression in GHCi’s memory behaviour (#24116), where reloading a project would use double the amount of expected memory. During each reload of a project, the memory usage would uniformly increase, only to return to the expected level after the reload had concluded.

Reproducing the problem

In order to investigate the issue we loaded Agda into a GHCi session and reloaded the project a few times. Agda is the kind of project we regularly use to analyse compiler performance as it’s a typical example of a medium size Haskell application.

The profile starts with the initial load of Agda into GHCi, then each subsequent vertical line represents a :reload call.

We can see that while loading the project a second time, GHCi seems to hold on to all of the in-memory compilation artifacts from the first round of compilation, before releasing them right as the load finishes. This is not expected behaviour for GHCi, and ideally it should stay at a roughly constant level of heap usage during a reload.

During a reload, GHCi should either

Keep compilation artifacts from the previous build in memory if it determines that they are still valid after recompilation avoidance checking
Release them as soon as it realizes they are out of date, replacing them with fresh artifacts

In either case, the total heap usage shouldn’t change, since we are either keeping old artifacts or replacing them with new ones of roughly similar size. This task is a perfect fit for eras profiling, if we can get assign a different era for each reload then we should be able to easily confirm our hypothesis about the memory usage pattern.

Instrumenting GHCi

First we instrument the program so that the era is incremented on each reload. We do this by modifying the load function in GHC to increment the era on each call:

--- a/compiler/GHC/Driver/Make.hs
+++ b/compiler/GHC/Driver/Make.hs
@@ -153,6 +153,7 @@ import GHC.Utils.Constants
 import GHC.Types.Unique.DFM (udfmRestrictKeysSet)
 import GHC.Types.Unique
 import GHC.Iface.Errors.Types
+import GHC.Profiling.Eras

 import qualified GHC.Data.Word64Set as W

@@ -702,6 +703,8 @@ load' mhmi_cache how_much diag_wrapper mHscMessage mod_graph = do
     -- In normal usage plugins are initialised already by ghc/Main.hs this is protective
     -- for any client who might interact with GHC via load'.
     -- See Note [Timing of plugin initialization]
+    era <- liftIO getUserEra
+    liftIO $ setUserEra (era + 1)
     initializeSessionPlugins
     modifySession $ \hsc_env -> hsc_env { hsc_mod_graph = mod_graph }
     guessOutputFile

Then when running the benchmark with eras profiling enabled, the profile looks as follows:

Now we can clearly see that after the reload (the vertical line), all the memory which has been allocated during era 2 remains alive as newly allocated memory belongs to era 3.

Identifying a culprit closure

With the general memory usage pattern established, it’s time to look more closely at the culprits. By performing an info table profile and looking at the detailed tab, we can identify a specific closure which contributes to the bad memory usage pattern.

The GRE closure is one of the top entries in the info table profile, and we can see that its pattern of heap usage matches the overall shape of the graph, which means that we are probably incorrectly holding on to GREs from the first round of compilation.

Now we can turn to more precise debugging tools in order to actually determine where the memory leak is.

Looking closer with `ghc-debug`

We decided to investigate the leak using ghc-debug. After instrumenting the GHC executable we can connect to a running ghc process with ghc-debug-brick and explore its heap objects using a TUI interface.

Tracing retainers with `ghc-debug`

To capture the leak, we pause the GHC process right before it finished the reload, while it is compiling the final few modules in Agda. Remember that all the memory is released after the end of the reload to return to the expected baseline.

In order to check for the cause of the leak, we do a search for the retainers of GRE closure in ghc-debug-brick. We are searching for GRE because the info table profile indicated that this was one type of closure which was leaking.

Before eras profiling, if we tried to use this knowledge and ghc-debug-brick to find out why the GREs are being retained then we got a bit stuck. Looking at the interface we can’t distinguish between the two distinct classes of live GREs:

Fresh GREs from the current load (era 3), which we really do need in memory.
Stale GREs from the first load (era 2), which shouldn’t be live anymore and should have be released.

This is the retainer view of ghc-debug, where all closures matching our search (constructor is GRE) are listed, and expanding a closure shows a tree with a retainer stack of all the heap objects which retain the closure. Reading this stack downwards you can determine a chain of references through which any particular closure is retained, going all the way back to a GC root. Inspecting the retainer stack can shed light on why your program is holding on to a particular object.

We can try to scroll through the list of GREs in the TUI, carefully inspecting the retainer stack of each in turn and using our domain knowledge to classify each GRE closure as belonging to one of the two categories above.

However, this process is tedious and error prone, especially given that we have such a large number of potentially leaking objects to inspect. Depending on the order that ghc-debug happened to traverse the heap, we might find leaking entries after inspecting the first few items, or we may be very unlucky and all the leaking items might be hundreds or thousands of entries deep into the list.

`ghc-debug` supercharged with eras profiling

Now with eras profiling there are two extensions to ghc-debug which make it easy to distinguish these two cases since we already distinguished the era of the objects.

Filtering By Era: You can filter the results of any search to only include objects allocated during particular eras, given by ranges.
Colouring by Era: You can also enable colouring by era, so that the background colour of entries in ghc-debug-brick is selected based on the era the object was allocated in, making it easy to visually partition the heap and quickly identify leaking objects.

So now, if we enable filtering by era, it’s easy to distinguish the new and old closures.

With the new filtering mode, we search for retainers of GREs which were allocated in era 2. Now we can inspect the retainer stacks of any one of these closures with the confidence that it has leaked. Because ghc-debug-brick is also colouring by era, we can also easily identify roughly where in the retainer stack the leak occurs, because we can see new objects (from era 3) holding on to objects from the previous era (era 2).

We can see that a GRE from era 2 (green) is being retained, through a thunk from GHC.Driver.Make allocated in era 3 (yellow):

The location of the thunk tells us the exact location of the leak, and it is now just of matter of understanding why this code is retaining on to the unwanted objects and plugging the leak. For more details on the actual fix, see !11608.

Conclusion

We hope the new ghc-debug features and the eras profiling mode will be useful to others investigating the memory behaviour of their programs and easily identifying leaking objects which should not be retained in memory.

Well-Typed are always interested in projects and looking for funding to improve GHC and other Haskell tools. Please contact info@well-typed.com if we might be able to work with you!

Well-Typed Blog

Improvements to the ghc-debug terminal interface

Recap: using ghc-debug

TUI improvements

Exploring Cost Center Stacks in the TUI

A filter based workflow

Improvements to profiling commands

Other UI improvements

Conclusion

Choreographing a dance with the GHC specializer (Part 1)

The Haskell Unfolder Episode 23: specialisation

Ad-hoc polymorphism

Specialization

Why is this an optimization?

Automatic specialization

When does automatic specialization happen?

To inline or to specialize?

Cross-module automatic specialization

Exposed unfoldings and the INLINABLE pragma

The automatic specialization decision graph

The specialization spectrum

Summary

Footnotes

Haskell development job with Well-Typed

About Well-Typed

About the job

About you

Offer details

Calling Haskell from Swift

Introduction

Marshaling Inputs and Outputs

Haskell’s Perspective

Swift’s Perspective

The Haskell Unfolder Episode 22: foldr-build fusion

The Haskell Unfolder Episode 22: foldr-build-fusion

About the Haskell Unfolder

GHC activities report: December 2023–February 2024

Team

Releases

Eras profiling

A new home for GHC’s internals

Exception backtraces

GHC Steering Committee and GHC2024

STM correctness and performance

Specialisation and late plugins

Cabal

GHC bug fixes

Build system, CI and distribution improvements

The Haskell Unfolder Episode 21: testing without a reference

The Haskell Unfolder Episode 21: testing without a reference

About the Haskell Unfolder

The Haskell Unfolder Episode 20: Dijkstra's shortest paths

The Haskell Unfolder Episode 20: Dijkstra’s shortest paths

About the Haskell Unfolder

The Haskell Unfolder Episode 19: a new perspective on foldl'

The Haskell Unfolder Episode 19: a new perspective on foldl’

About the Haskell Unfolder

Eras profiling for GHC

Introduction to eras profiling

Diagnosing a GHCi memory leak

Reproducing the problem

Instrumenting GHCi

Identifying a culprit closure

Looking closer with ghc-debug

Tracing retainers with ghc-debug

ghc-debug supercharged with eras profiling

Conclusion

Recap: using `ghc-debug`

Exposed unfoldings and the `INLINABLE` pragma

GHC Steering Committee and `GHC2024`

Looking closer with `ghc-debug`

Tracing retainers with `ghc-debug`

`ghc-debug` supercharged with eras profiling