TL;DR Build your Haskell projects 10-15% faster with this one simple trick! (Spoiler: the simple trick is to wait for the next major cabal-install release.)

In previous work (paid for by the Sovereign Tech Fund) we did a lot of heavy lifting to make a major architectural change to Cabal. That work is now paying off with practical benefits. This post covers follow-on architectural improvements to cabal-install which then enable us to eliminate redundant work in the configure phase, yielding significant reductions in build times.

The changes will be available to everyone in the next major cabal-install release. For a large project like pandoc (including all of its dependencies) we measure a 10% (std.dev. 0.6pp) reduction in wall clock time for a 16-way parallel build with --semaphore. No user changes are needed to take advantage of this improvement.

History: Cabal and cabal-install

The genesis: the Cabal specification

First, there was Cabal. Its design was laid out in A Common Architecture for Building Applications and Tools. Fundamentally, it defines the notion of a package, with each package being built and installed with the following sequence of commands:

> hc Setup.hs
> ./Setup configure
> ./Setup build
> ./Setup install

Each package must be built in dependency order, with hc-pkg registering each installed library into a package database.

Orchestrating the build of multiple packages

cabal-install was then born to plan and execute a build plan consisting of many packages. With its solver, it determines a build plan, which is then orchestrated by running the above sequence of commands for each package, in dependency order.

There is however one architectural mismatch: for the solver to be able to compute a build plan, it already needs a lot of information about the current system:

  • What Haskell compiler are we using?
  • What system libraries are available (pkgconfig-depends)?
  • What build tools are available (build-tool-depends)?

This means that cabal-install already has in its hands most of the information necessary for configuring a package; in particular it has already resolved all the conditionals in every package description. We should thus be able to skip most of the steps in the package’s ./Setup configure phase. However, the command-line interface of ./Setup configure makes it practically impossible to do so: passing a fully resolved dependency graph would require many additions to the already bloated ConfigFlags datatype, and a lot more data being serialised/deserialised.

Because of this limitation, cabal-install’s approach was to take its hard-won build plan and convert it into ConfigFlags that specify exact dependency versions and flag assignments. This amounts to passing ./Setup configure an already fully constrained configuration; the configure step would then re-probe the system, re-read package databases… only to re-discover exactly what cabal-install already knew!

A new architecture for cabal-install

The paradigm shift proposed in our Sovereign Tech Fund proposal is that cabal-install should be responsible for orchestrating the whole build process instead of running the conceptually independent build systems provided by each package. With cabal-install now in control, it can directly call Cabal library functions, which in turn allows skipping steps in the configure phase that waste time re-discovering information that cabal-install is already aware of.

To implement such a change, we first needed to prepare the terrain: when invoking an external executable such as the Setup executable – say via the process library as Cabal uses – we can set the working directory, environment variables and redirect input/output handles. It was not possible to do this directly via the Cabal library, so we first needed to add Cabal library support for setting the working directory and for choosing logging handles. Once this was done, it allowed us to refactor cabal-install to directly call Cabal library functions to build packages.

Performance impact

This architectural change provides a solid foundation for further improvements. The two main time sinks in the Cabal configure phase were determined to be (using a new --build-timings flag to cabal-install):

  1. (~50% of configure time) Re-configuring the compiler program database. The compiler and hc-pkg were already pre-configured, but other programs such as haddock, ar, ld etc were re-configured anew for each package.
  2. (~40% of configure time) Re-probing the installed package database, via hc-pkg dump.

We can skip this extra work by pre-configuring the compiler ProgramDb and keeping a running InstalledPackageIndex. These two changes, taken together, reduce the time spent in the configure phase by over 90%.

While most of the time in builds is unsurprisingly spent… actually compiling Haskell code [citation needed], the impact on full builds is still rather significant. For example, when compiling aeson with -j1, we saw a reduction in total build time of ~16.6% (std.dev. 1.9pp) in our benchmarks.

The fact that the configure phase is inherently serial also means that these improvements have a notable impact when combined with the -jsem feature. This is because the -jsem feature allows us to assign more capabilities to the build phase. As per Amdahl’s law, this results in the configure phase becoming more of a bottleneck. For example, when compiling pandoc with cabal install pandoc -j16 --semaphore, we saw a reduction in total build time of ~10% (std.dev. 0.6pp).

Further improvements

These improvements provide a small glimpse of what is possible after our changes to cabal-install’s architecture. A more ambitious long-term goal would be for cabal-install to manage a “giant build graph” on a finer granularity level than whole Cabal components. For example, if package q depends only on module P1 from package p, we could imagine starting to compile q after compiling P1 but before we have finished compiling the rest of p. This would unlock build-time reductions by increasing available parallelism, and also enable more accurate progress and error reporting.