Lightweight Checked Exceptions in Haskell

Friday, 31 July 2015, by Edsko de Vries, Adam Gundry.
Filed under coding.

Consider this function from the http-conduit library:

-- | Download the specified URL (..)
--
-- This function will 'throwIO' an 'HttpException' for (..)
simpleHttp :: MonadIO m => String -> m ByteString

Notice that part of the semantics of this function—that it may throw an HttpException—is encoded in a comment, which the compiler cannot check. This is because Haskell’s notion of exceptions offers no mechanism for advertising to the user the fact that a function might throw an exception.

Michael Snoyman discusses some solutions to this problem, as well as some common anti-patterns, in his blog post Exceptions Best Practices. However, wouldn’t it be much nicer if we could simply express in the type that simpleHttp may throw a HttpException? In this blog post I will propose a very lightweight scheme to do precisely that.

If you want to experiment with this yourself, you can download Checked.hs (tested with ghc 7.2, 7.4, 7.6, 7.8 and 7.10).

Throwing checked exceptions

Let’s introduce a type class for “checked exceptions” (à la Java):

-- | Checked exceptions
class Throws e where
  throwChecked :: e -> IO a

This looks simple enough, but here’s the rub: this will be a type class without any instances.

Once we have introduced Throws we can write something like

simpleHttp :: (MonadIO m, Throws HttpException) => String -> m ByteString
simpleHttp _ = liftIO $ throwChecked HttpException

And, unless we explicitly catch this exception, this annotation will now be propagated to every use site of simpleHttp:

useSimpleHttp :: Throws HttpException => IO ByteString
useSimpleHttp = simpleHttp "http://www.example.com"

Type annotations

There’s something a little peculiar about a type class constraint such as Throws HttpException: normally ghc will refuse to add a type class constraint for a known (constant) type. If you were to write

foo = throwChecked $ userError "Uhoh"

ghc will complain bitterly that

No instance for (Throws IOError)
    arising from a use of ‘throwChecked’
In the expression: throwChecked

until you give the type annotation explicitly (you will need to enable the FlexibleContexts language extension):

foo :: Throws IOError => IO a
foo = throwChecked $ userError "Uhoh"

I consider this a feature, not a bug of this approach: you are forced to explicitly declare the checked exceptions you throw.

Catching checked exceptions

In order to catch checked exceptions (and indeed, in order make throwChecked actually do anything) we need to turn that throwChecked into an honest-to-goodness throwIO. We can do this by creating a type class instance at runtime (requires RankNTypes and ScopedTypeVariables, and from ghc 7.10 and up also AllowAmbiguousTypes):

-- | Rethrow checked exceptions as unchecked (regular) exceptions
rethrowUnchecked :: forall e a. (Throws e => IO a) -> (Exception e => IO a)
rethrowUnchecked act = aux act throwIO
  where
    aux :: (Throws e => IO a) -> ((e -> IO a) -> IO a)
    aux = unsafeCoerce . Wrap

-- | Wrap an action that may throw a checked exception
--
-- This is used internally in 'rethrowUnchecked' to avoid impredicative
-- instantiation of the type of 'unsafeCoerce'.
newtype Wrap e a = Wrap (Throws e => IO a)

This method of creating type class instances at runtime is a well-known trick in the Haskell community, used for instance in Edward Kmett’s reflection package and Richard Eisenberg’s singletons package. Austin Sepp’s blogpost Reflecting values to types and back explains it in detail.

For us here the important thing is that rethrowUnchecked basically replaces uses of throwChecked with a regular throwIO, and hence we can catch the resulting “unchecked” exception as usual:

-- | Catch a checked exception
--
-- This is the only way to discharge a 'Throws' type class constraint.
catchChecked :: Exception e => (Throws e => IO a) -> (e -> IO a) -> IO a
catchChecked = catch . rethrowUnchecked

-- | 'catchChecked' with the arguments reversed
handleChecked :: Exception e => (e -> IO a) -> (Throws e => IO a) -> IO a
handleChecked act handler = catchChecked handler act

Subclasses of exceptions

Suppose we had

readFile :: Throws IOException => FilePath -> IO String

then we can write a function to get a file either by reading a local file or by downloading it over HTTP:

get :: (Throws IOException, Throws HttpException)
    => String -> IO ByteString
get url = case removePrefix "file:" url of
            Just path -> readFile path
            Nothing   -> simpleHttp url

removePrefix :: [a] -> [a] -> Maybe [a]
removePrefix = ..

Alternatively we can define a bespoke exception hierarchy and combine the two exceptions:

data SomeGetException = forall e. Exception e => SomeGetException e

wrapIO :: (Throws IOException      => IO a)
       -> (Throws SomeGetException => IO a)
wrapIO = handleChecked $ throwChecked . SomeGetException

wrapHttp :: (Throws HttpException    => IO a)
         -> (Throws SomeGetException => IO a)
wrapHttp = handleChecked $ throwChecked . SomeGetException

get :: Throws SomeGetException => String -> IO ByteString
get url = case removePrefix "file:" url of
            Just path -> wrapIO   $ readFile path
            Nothing   -> wrapHttp $ simpleHttp url

This kind of custom exception hierarchy is entirely standard; I just wanted to show it fits nicely into this approach to checked exceptions.

Caveat

There is one caveat to be aware of. Suppose we write

returnAction = return (simpleHttp "http://www.example.com")

Ideally we’d give this a type such as

returnAction :: IO (Throws HttpException => IO ByteString)
returnAction = return (simpleHttp "http://www.example.com")

But this requires impredicative types, which is still a no-go zone. Instead the type of returnAction will be

returnAction :: Throws HttpException => IO (IO ByteString)
returnAction = return (simpleHttp "http://www.example.com")

which has the Throws annotation on returnAction itself; this means we can make the annotation disappear by adding an exception handler to returnAction even though it’s never called (because returnAction itself never throws any exception).

returnAction' :: IO (IO ByteString)
returnAction' = catchChecked returnAction neverActuallyCalled
  where
    neverActuallyCalled :: HttpException -> IO (IO ByteString)
    neverActuallyCalled = undefined

This is somewhat unfortunate, but it occurs only infrequently and it’s not a huge problem in practice. If you do need to return actions that may throw exceptions, you can use a newtype wrapper such as Wrap that we used internally in rethrowUnchecked (for much the same reason):

returnAction :: IO (Wrap HttpException IO)
returnAction = return (Wrap $ simpleHttp "http://www.example.com")

Of course you will probably want to define a datatype that is more meaningful for your specific application; for instance, see see the definition of HttpClient in the Repository.Remote module, which defines something like

data HttpClient = HttpClient {
    httpClientGet :: Throws SomeRecoverableException => URI -> ...
  }

Conclusions

Of course, a type such as

simpleHttp :: (MonadIO m, Throws HttpException) => String -> m ByteString

does not tell you that this function can only throw HttpExceptions; it can still throw all kinds of unchecked exceptions, not least of which asynchronous exceptions. But that’s okay: it can still be incredibly useful to track some exceptions through your code.

So there you have it: checked exceptions in Haskell using

and without introducing a special new kind of monad (such as in the control-monad-exception package) and without complicated type level hacking as in the Checked Exception for Free blogpost.

Right now I’ve implemented this as a small module Checked.hs in the hackage-security library, but if people think this is a worthwhile approach perhaps it would be useful to release this as a separate tiny library.


Hackage Security Alpha Release

Wednesday, 08 July 2015, by Edsko de Vries.
Filed under cabal, community, industrial-haskell-group.

Well-Typed is very happy to announce the first alpha release of the Hackage Security library, along with integration into both cabal-install and the Hackage server and a tool for managing file-based secure repositories. This release is not yet ready for general use, but we would like to invite interested parties to download and experiment with the library and its integration. We expect a beta release running on the central Hackage server will soon follow.

Hackage Security and related infrastructure is a project funded by the Industrial Haskell Group to secure Hackage, the central Haskell package server. A direct consequence of this work is that we can have untrusted Hackage mirrors (mirror selection is directly supported by the library). A minor but important additional side goal is support for incremental updates of the central Hackage index (only downloading information about new packages, rather than all packages).

TL;DR: Hackage will be more secure, more reliable and faster, and cabal update should generally finish in seconds.

Security architecture

Security is notoriously difficult to get right, so rather than concocting something ad-hoc the Hackage Security framework is based on TUF, The Update Framework. TUF is a collaboration between security researches at the University of Arizona, the University of Berkeley and the Univerity of Washington, as well as various developers of the Tor project. It is a theory specifically designed for securing software update systems, and suits the needs of Hackage perfectly.

TUF covers both index signing and author signing. Index signing provides the means to verify that something we downloaded from a mirror is the same as what is available from the central server (along with some other security properties), thus making it possible to set up untrusted mirrors. It does not however deal with compromise of the central server. Author signing allows package authors to sign packages, providing a guarantee that the package you download is the one that the author uploaded. These two concerns are largely orthogonal, and the current project only adds support for index signing. Author signing will be the subject of a later project.

Very briefly, here is how it works. The bits in red refer to new features added as part of the Hackage Security work.

  1. Hackage provides a file 00-index.tar.gz (known as “the index”) which contains the .cabal files for all versions of all packages on the server. It is this file that cabal update downloads, and it is this file that cabal install uses to find out which packages are available and what their dependencies are.

    Hackage additionally provides a signed file /snapshot.json (“the snapshot”), containing a hash of the index. When cabal downloads the index it computes its hash and verifies it against the hash recorded in the snapshot. Since mirrors do not have the key necessary to sign the snapshot (“ the snapshot key”), if the hash matches we know that the index we downloaded, and hence all files within, was the same as on the central server.
  2. When you cabal install one or more packages, the index provides cabal with a list of packages and their dependencies. However, the index does not contain the packages themselves.

    The index does however contain package.json files for each version of each package, containing a hash for the .tar.gz of that package version. Since these files live in the index they are automatically protected by the snapshot key. When cabal downloads a package it computes the hash of the package and compares it against the hash recorded in the index; if it matches, we are guaranteed that the package is the same as the package on the central server, as the central server is the only one with access to the snapshot key.
  3. The client does not have built-in knowledge of the snapshot key. Instead, it can download /root.json (“the root metadata”) from the server, which contains the public snapshot key. The root metadata itself is signed by the root keys, of which the client does have built-in knowledge. The private root keys must be kept very securely (e.g. encrypted and offline).

This description leaves out lots of details, but the purpose of this blog post is not to give a full overview of TUF. See the initial announcement or the website of The Update Framework for more details on TUF; the Hackage Security project README provides a very detailed discussion on how our implemention of TUF relates to the official TUF specification.

Software architecture

Most of the functionality is provided through a new library called hackage-security, available from github, designed to be used by clients and servers alike.

Client-side

Although we have integrated it in cabal-install, the hackage-security library is expressly designed to be useable by different clients as well. For example, it generalizes over the library to use for HTTP requests; cabal uses hackage-security-HTTP, a thin layer around the HTTP library. However, if a client such as stack wants to use the hackage-security library to talk to Hackage it may prefer to use hackage-security-http-client instead, a thin layer around the http-client library.

Using the library is very simple. After importing Hackage.Security.Client three functions become available, corresponding to points 1, 2 and 3 above:

checkForUpdates :: Repository -> CheckExpiry -> IO HasUpdates
downloadPackage :: Repository -> PackageId -> (TempPath -> IO a) -> IO a
bootstrap       :: Repository -> [KeyId] -> KeyThreshold -> IO ()

Some comments:

We have written an example client that demonstrates how to use this API; the example client supports both local and remote repositories and can use HTTP, http-client or curl, and yet is only just over 100 lines of code.

Server-side

The server-side support provided by Hackage.Security.Server comes primarily in the form of datatypes corresponding to the TUF metadata, along with functions for constructing them.

It is important to realize that servers need not be running the Hackage software; mirrors of the central Hackage server may (and typically will) be simple HTTP file servers, and indeed company-internal package servers may choose not to use the Hackage software at all, using only file servers. We provide a hackage-security utility for managing such file-based repositories; see below for details.

Trying it out

There are various ways in which you can try out this alpha release, depending on what precisely you are interested in.

Resources at a glance

hackage-security library github tag “pre-release-alpha”
cabal-install github branch “using-hackage-security”
hackage github branch “using-hackage-security”

Secure Hackage snapshots

We provide two almost-complete secure (but mostly static) Hackage snapshots, located at

http://hackage.haskell.org/security-alpha/mirror1
http://hackage.haskell.org/security-alpha/mirror2

If you want to use cabal to talk to these repositories, you will need to download and build it from the using-hackage-security branch. Then change your cabal config and add a new section:

remote-repo alpha-snapshot
  url: http://hackage.haskell.org/security-alpha/mirror1
  secure: True
  root-keys: 89e692e45b53b575f79a02f82fe47359b0b577dec23b45d846d6e734ffcc887a
             dc4b6619e8ea2a0b72cad89a3803382f6acc8beda758be51660b5ce7c15e535b
             1035a452fd3ada87956f9e77595cfd5e41446781d7ba9ff9e58b94488ac0bad7
  key-threshold: 2

It suffices to point cabal to either mirror; TUF and the hackage-security library provide built-in support for providing clients with a list of mirrors. During the first check for updates cabal will download this list, and then use either mirror thereafter. Note that if you wish you can set the key-threshold to 0 and not specify any root keys; if you do this, the initial download of the root information will not be verified, but all access will be secure after that.

These mirrors are almost complete, because the first mirror has an intentional problem: the latest version of generics-sop does not match its signed hash (simulating an evil attempt from an attacker to replace the generics-sop library with DoublyEvil-0.3.142). If you attempt to cabal get this library cabal should notice something is amiss on this mirror, and automatically try again from the second mirror (which has not been “compromised”):

# cabal get generics-sop
Downloading generics-sop-0.1.1.2...
Selected mirror http://hackage.haskell.org/security-alpha/mirror1
Downloading package generics-sop-0.1.1.2
Exception Invalid hash for .../generics-sop-0.1.1.2.tar45887.gz
when using mirror http://hackage.haskell.org/security-alpha/mirror1
Selected mirror http://hackage.haskell.org/security-alpha/mirror2
Downloading package generics-sop-0.1.1.2
Unpacking to generics-sop-0.1.1.2/

(It is also possible to use the example client to talk to these mirrors, or indeed to a secure repo of your own.)

Setting up a secure file-based repo

If you want to experiment with setting up your own secure repository, the easiest way to do this is to set up a file based repository using the hackage-security utility. A file based repository (as opposed to one running the actual Hackage software) is much easier to set up and will suffice for many purposes.

  1. Create a directory ~/my-secure-repo containing a single subdirectory ~/my-secure-repo/package. Put whatever packages you want to make available from your repo in this subdirectory. At this point your repository might look like

    ~/my-secure-repo/package/basic-sop-0.1.0.5.tar.gz
    ~/my-secure-repo/package/generics-sop-0.1.1.1.tar.gz
    ~/my-secure-repo/package/generics-sop-0.1.1.2.tar.gz
    ~/my-secure-repo/package/json-sop-0.1.0.4.tar.gz
    ~/my-secure-repo/package/lens-sop-0.1.0.2.tar.gz
    ~/my-secure-repo/package/pretty-sop-0.1.0.1.tar.gz

    (because obviously the generics-sop packages are the first things to come to mind when thinking about which packages are important to secure.) Note the flat directory structure: different packages and different versions of those packages all live in the one directory.

  2. Create public and private keys:

    # hackage-security create-keys --keys ~/my-private-keys

    This will create a directory structure such as

    ~/my-private-keys/mirrors/id01.private
    ~/my-private-keys/mirrors/..
    ~/my-private-keys/root/id04.private
    ~/my-private-keys/root/..
    ~/my-private-keys/snapshot/id07.private
    ~/my-private-keys/target/id08.private
    ~/my-private-keys/target/..
    ~/my-private-keys/timestamp/id11.private

    containing keys for all the various TUF roles (proper key management is not part of this alpha release).

    Note that these keys are stored outside of the repository proper.

  3. Create the initial TUF metadata and construct an index using

    # hackage-security bootstrap \
                       --repo ~/my-secure-repo \
                       --keys ~/my-private-keys

    This will create a directory ~/my-secure-repo/index containing the .cabal files (extracted from the package tarballs) and TUF metadata for all packages

    ~/my-secure-repo/index/basic-sop/0.1.0.5/basic-sop.cabal
    ~/my-secure-repo/index/basic-sop/0.1.0.5/package.json
    ~/my-secure-repo/index/generics-sop/0.1.1.1/generics-sop.cabal
    ~/my-secure-repo/index/generics-sop/0.1.1.1/package.json
    ...

    and package the contents of that directory up as the index tarball ~/my-secure-repo/00-index.tar.gz; it will also create the top-level metadata files

    ~/my-secure-repo/mirrors.json
    ~/my-secure-repo/root.json
    ~/my-secure-repo/snapshot.json
    ~/my-secure-repo/timestamp.json
  4. The timestamp and snapshot are valid for three days, so you will need to resign these files regularly using

    # hackage-security update \
                       --repo ~/my-secure-repo \
                       --keys ~/my-private-keys

    You can use the same command whenever you add any additional packages to your repository.

  5. If you now make this directory available (for instance, by pointing Apache at it) you should be able to use cabal to access it, in the same way as described above for accessing the secure Hackage snapshots. You can either set key-threshold to 0, or else copy in the root key IDs from the generated root.json file.

Setting up a secure Hackage server

If you are feeling adventurous you can also try to set up your own secure Hackage server. You will need to build Hackage from the using-secure-hackage branch.

You will need to create a subdirectory TUF inside Hackage’s datafiles/ directory, containing 4 files:

datafiles/TUF/mirrors.json
datafiles/TUF/root.json
datafiles/TUF/snapshot.private
datafiles/TUF/timestamp.private

containing the list of mirrors, the root metadata, and the private snapshot and timestamp keys. You can create these files using the hackage-security utility:

  1. Use the create-keys as described above to create a directory with keys for all roles, and then copy over the snapshot and timestamp keys to the TUF directory.
  2. Use the create-root and create-mirrors commands to create the root and mirrors metadata. The create-mirrors accepts an arbitrary list of mirrors to be added to the mirrors metadata, should you wish to do so.

Note that the root.json and mirrors.json files are served as-is by Hackage, they are not used internally; the snapshot and timestamp keys are of course used to sign the snapshot and the timestamp.

Once you have created and added these files, everything else should Just Work(™). When you start up your server it will create TUF metadata for any existing packages you may have (if you are migrating an existing database). It will create a snapshot and a timestamp file; create metadata for any new packages you upload and update the snapshot and timestamp; and resign the snapshot and timestamp nightly. You can talk to the repository using cabal as above.

If you a have Hackage server containing a lot of packages (a full mirror of the central Hackage server, for instance) then migration will be slow; it takes approximately an hour to compute hashes for all packages on Hackage. If this would lead to unacceptable downtime you can use the precompute-fileinfo tool to precompute hashes for all packages, given a recent backup. Copy the file created by this tool to datafiles/TUF/md5-to-sha256.map before doing the migration. If all hashes are precomputed migration only takes a few minutes for a full Hackage snapshot.

Integrating hackage-security

If you want to experiment with integrating the hackage-security library into your own software, the example client is probably the best starting point for integration in client software, and the hackage-security utility is probably a good starting point for integration in server software.

Bugs

Please report any bugs or comments you may have as GitHub issues.

Roadmap

This is an alpha release, intended for testing by people with a close interest in the Hackage Security work. The issue tracker contains a list of issues to be resolved before the beta release, at which point we will make the security features available on the central Hackage server and make a patched cabal available in a more convenient manner. Note that the changes to Hackage are entirely backwards compatible, so existing clients will not be affected.

After the beta release there are various other loose ends to tie up before the official release of Phase 1 of this project. After that Phase 2 will add author signing.


Dependencies for Cabal Setup.hs files and other goodies

Monday, 06 July 2015, by Duncan Coutts.
Filed under cabal, community, industrial-haskell-group.

No untracked dependencies!

Years ago, back when Isaac Potoczny-Jones and others were defining the Cabal specification, the big idea was to make Haskell software portable to different environments. One of the mantras was “no untracked dependencies!”.

The problem at the time was that Haskell code had all kinds of implicit dependencies which meant that while it worked for you, it wouldn’t build for me. For example, I might not have some other module that it needed, or the right version of the module.

So of course that’s what the build-depends in .cabal files is all about, requiring that the author of the code declare just what the code requires of its environment. The other important part is that the build system only lets your code see the dependencies you’ve declared, so that you don’t accidentally end up with these untracked dependencies.

This mantra of no untracked dependencies is still sound. If we look at a system like nix, part of what enables it to work so well is that it is absolutely fanatical about having no untracked dependencies.

Untracked dependencies?!

One weakness in the original Cabal specification is with Setup.hs scripts. These scripts are defined in the spec to be the entry point for the system. According to the Cabal spec, to build a package you’re required to compile the Setup.hs script and then use its command line interface to get things done. Because in the original spec the Setup.hs is the first entry point, it’s vital that it be possible to compile Setup.hs without any extra fuss (the runhaskell tool was invented just to make this possible, and to make it portable across compilers).

But by having the Setup.hs as the primary entry point, it meant that it’s impossible to reliably use external code in a Setup.hs script, because you cannot guarantee that that code is pre-installed. Going back to the “no untracked dependencies” mantra, we can see of course that all dependencies of Setup.hs scripts are in fact untracked!

This isn’t just a theoretical problem. Haskell users that do have complex Setup.hs scripts often run into versioning problems, or need external tools to help them get the pre-requisite packages installed. Or as another example: Michael Snoyman noted earlier this year in a diagnosis of an annoying packaging bug that:

As an aside, this points to another problematic aspect of our toolchain: there is no way to specify constraints on dependencies used in custom Setup.hs files. That’s actually caused more difficulty than it may sound like, but I’ll skip diving into it for now.

The solution: track dependencies!

As I said, the mantra of no untracked dependencies is still sound, we just need to apply it more widely.

These days the Setup.hs is effectively no longer a human interface, it is now a machine interface used by other tools like cabal or by distro’s install scripts. So we no longer have to worry so much about Setup.hs scripts always compiling out of the box. It would be acceptable now to say that the first entry point for a tool interacting with a package is the .cabal file, which might list the dependencies of the Setup.hs. The tool would then have to ensure that those dependencies are available when compiling the Setup.hs.

So this is exactly what we have now done. Members of the Industrial Haskell Group have funded us to fix this long standing problem and we have recently merged the solution into the development version of Cabal and cabal-install.

From a package author’s point of view, the solution looks like this: in your .cabal file you can now say:

build-type: Custom

custom-setup
  setup-depends: base >= 4.6,
                 directory >= 1.0,
                 Cabal >= 1.18 && < 1.22,
                 acme-setup-tools == 0.2.*

So it’s a new stanza, like libraries or executables, and like these you can specify the library dependencies of the Setup.hs script.

Now tools like cabal will compile the Setup.hs script with these and only these dependencies, just like it does normally for executables. So no more untracked dependencies in Setup.hs scripts. Newer cabal versions will warn about not using this new section. Older cabal versions will ignore the new section (albeit with a warning). So over time we hope to encourage all packages with custom setup scripts to switch over to this.

In addition, the Setup.hs script gets built with CPP version macros (MIN_VERSION_{pkgname}) available so that the code can be made to work with a wider range of versions of its dependencies.

In the solver…

So on the surface this is all very simple and straightforward, a rather minor feature even. In fact it’s been remarkably hard to implement fully for reasons I’ll explain, but the good news is that it works and the hard work has also gotten us solutions to a couple other irksome problems.

Firstly, why isn’t it trivial? It’s inevitable that sooner or later you will find that your application depends on one package that has setup deps like Cabal == 1.18.* and another with setup deps like Cabal == 1.20.*. At that point we have a problem. Classically we aim to produce a build plan that uses at most one version of each package. We do that because otherwise there’s a danger of type errors from using multiple versions of the same package. Here with setup dependencies there is no such danger: it’s perfectly possible for me to build one setup script with one version of the Cabal library and another script with a different Cabal version. Because these are executables and not libraries, the use of these dependencies does not “leak”, and so we would be safe to use different versions in different places.

So we have extended the cabal solver to allow for limited controlled use of multiple versions of the same package. The constraint is that all the “normal” libraries and exes all use the same single version, just as before, but setup scripts are allowed to introduce their own little world where independent choices about package versions are allowed. To keep things sane, the solver tries as far as possible not to use multiple versions unless it really has to.

If you’re interested in the details in the solver, see Edsko’s recent blog post.

Extra goodies

This work in the solver has some extra benefits.

Improve Cabal lib API without breaking everything

In places the Cabal library is a little crufty, and the API it exposes was never really designed as an API. It has been very hard to fix this because changes in the Cabal library interface break Setup.hs scripts, and there was no way for packages to insulate themselves from this.

So now that we can have packages have proper dependencies for their custom Setup.hs, the flip side is that we have an opportunity to make breaking changes to the Cabal library API. We have an opportunity to throw out the accumulated cruft, clean up the code base and make a library API that’s not so painful to use in Setup.hs scripts.

Shim (or compat) packages for base

Another benefit is that the new solver is finally able to cope with having “base shim” packages, as we used in the base 3.x to 4.x transition. For two GHC releases, GHC came with both base-3.x and base-4.x. The base-4 was the “true” base, while the base-3 was a thin wrapper that re-exported most of base-4 (and syb), but with some changes to implement the old base-3 API. At the time we adapted cabal to cope with this situation of having two versions of a package in a single solution.

When the new solver was implemented however support for this situation was not added (and the old solver implementation was retained to work with GHC 6.12 and older).

This work for setup deps has made it relatively straightforward to add support for these base shims. So next time GHC needs to make a major bump to the version of base then we can use the same trick of using a shim package. Indeed this might also be a good solution in other cases, perhaps cleaner than all these *-compat packages we’ve been accumulating.

It has also finally allowed us to retire the old solver implementation.

Package cycles involving test suites and benchmarks

Another feature that is now easy to implement (though not actually implemented yet) is dealing with the dependency cycles in packages’ test suites and benchmarks.

Think of a core package like bytestring, or even less core like Johan’s cassava csv library. These packages have benchmarks that use the excellent criterion library. But of course criterion is a complex beast and itself depends on bytestring, cassava and a couple dozen other packages.

This introduces an apparent cycle and cabal will fail to find an install plan. I say apparent cycle because there isn’t really a cycle: it’s only the benchmark component that uses criterion, and nothing really depends on that.

Here’s another observation: when benchmarking a new bytestring or cassava, it does not matter one bit that criterion might be built against an older stable version of bytestring or cassava. Indeed it’s probably sensible that we use a stable version. It certainly involves less rebuilding: I don’t really want to rebuild criterion against each minor change in bytestring while I’m doing optimisation work.

So here’s the trick: we break the cycle by building criterion (or say QuickCheck or tasty) against another version of bytestring, typically some existing pre-installed one. So again this means that our install plan has two versions of bytestring in it: the one we mean to build, and the one we use as a dependency for criterion. And again this is ok, just as with setup dependencies, because dependencies of test suites and benchmarks do not “leak out” and cause diamond dependency style type errors.

One technical restriction is that the test suite or benchmark must not depend on the library within the same package, but must instead use the source files directly. Otherwise there would genuinely be a cycle.

Now in general when we have multiple components in a .cabal file we want them to all use the same versions of their dependencies. It would be deeply confusing if a library and an executable within the same package ended up using different versions of some dependency that might have different behaviour. Cabal has always enforced this, and we’re not relaxing it now. The rule is that if there are dependencies of a test suite or benchmark that are not shared with the library or executable components in the package, then we are free to pick different versions for those than we might pick elsewhere within the same solution.

As another example – that’s nothing to do with cycles – we might pick different versions of QuickCheck for different test suites in different packages (though only where necessary). This helps with the problem that one old package might require QuickCheck == 2.5.* while another requires QuickCheck == 2.8.*. But it’d also be a boon if we ever went through another major QC-2 vs QC-3 style of transition. We would be able to have both QC-2 and QC-3 installed and build each package’s test suite against the version it requires, rather than freaking out that they’re not the same version.

Private dependencies in general

Technically, this work opens the door to allowing private dependencies more generally. We’re not pursuing that at this stage, in part because it is not clear that it’s actually a good idea in general.

Mark Lentczner has pointed out the not-unreasonable fear that once you allow multiple versions of packages within the same solution it will in practice become impossible to re-establish the situation where there is just one version of each package, which is what distros want and what most people want in production systems.

So that’s something we should consider carefully as a community before opening those flood gates.


Summer School on Generic and Effectful Programming

Wednesday, 10 June 2015, by Andres Löh.
Filed under training, well-typed.

I’m one of the lecturers at

Summer School on Generic and Effectful Programming

St Anne’s College, Oxford, 6th to 10th July 2015

(Register here)

Datatype-generic programming was the topic of my PhD thesis many years ago, and it has continued to be a fascinating field of work and research for me since then.

At the upcoming summer school, I will give a three-lecture course on Applying Type-level and Generic Programming in Haskell. In this course, I will describe the state-of-the-art of datatype-generic programming in Haskell/GHC. This means we’ll look at the GHC extension that allows you to generically derive your own type classes, but also at the relatively recent generics-sop library. We will discuss the GHC type system features that make all of this possible, such as data kinds, kind polymorphism, GADTs, higher-rank types, constraint kinds and more, and we will look at a number of real-world applications of generic programming, taken, e.g., from the areas of web programming and databases.

But my course is only one of many. Ralf Hinze, the main organizer, has done an outstanding job and assembled a fantastic lineup of lecturers: I’m honoured to be teaching alongside Edwin Brady, Fritz Henglein, Conor McBride, Don Syme and Tarmo Uustalu. I am sure I will learn a lot from them and their lectures.

If you always wanted to learn more about generic and effectful programming, this is your chance! You can still register for the school! I’d be happy to see you there.


Cabal & Hackage hacking at ZuriHac

Monday, 01 June 2015, by Duncan Coutts.
Filed under cabal, community.

At ZuriHac this weekend we had eight people hacking on Cabal or Hackage, many of whom are new contributors. There were a number of projects started as well as a number of smaller fixes completed.

In addition, there are three Google Summer of Code students working on Cabal and Hackage projects this summer:

They’re all just getting started, so more news about them later. All in all there seems to be a decent amount of progress at the moment across a range of issues. In particular we’re getting closer to solving some of the thornier “Cabal Hell” problems.

Heroic bug squashing

Oleg Grenrus was a bit of a hero in that as a new Cabal contributor, over two days of the hackathon, he managed to send in pull requests to fix five open tickets.

Another couple chaps (whose names to my shame have slipped my mind) dived in to fix old tickets on sanity checking absolute/relative paths for library directories in .cabal files and config files, and on passing GHC env vars to sub-commands like in cabal run/exec/test.

These in addition to the flurry of pull requests in recent weeks, and others from the hackathon, has given the regular Cabal hackers quite a pile of patches to get through. We hope to review and merge them in the next week or so.

Integrating package security for Cabal/Hackage

The work on securing the package download process that we announced a while ago is nearing the integration phase. While it’s been useful to have a couple people concentrate on implementing the core hackage-security library, at this stage it makes sense to open the process up and get more people involved.

Matthias Fischmann had proposed it as a ZuriHac project and organised a group of people who were interested. We discussed some of the issues involved with using the new hackage-security code in the cabal-install tool, and got started on some of the tasks.

Bootstrapping repository security

With public key crypto systems there’s always a need to somehow bootstrap the trust chains. For example with the public web certificate system (used by TLS / HTTPS) the root of the trust chains is the certificate authorities. We must know and trust their public keys to be able to verify the chain of trust for any particular website. But how should we establish trust in the certificate authorities’ keys in the first place? With web browsers this bootstrapping problem is resolved by the browser (or OS) shipping pre-installed with all the CA public keys.

For hackage servers we face a similar bootstrapping problem. Hackage security does not use public certificate authorities but there is a similar root of trust in the form of a set of root keys. For the central community hackage.haskell.org we can of course do the same thing as the web browsers and ship the server’s public root keys with cabal-install. But we need to support people making their own repositories and obviously we can’t ship all the public keys. So we need a convenient way for people to configure cabal-install to establish trust in a particular repository. The obvious thing is to specify the trusted public keys in the cabal configuration, where you specify the repository to use.

Currently in a cabal configuration file that part looks like:

remote-repo: hackage.haskell.org:http://hackage.haskell.org/

This syntax is too limited to support adding extra attributes like keys. So what people were working on at ZuriHac was supporting something like this:

remote-repo hackage.haskell.org
  url: http://hackage.haskell.org/
  keys: ed25519:9fc1007af2baff7088d082295e755102c1593cdd24b5282adbfa0613f30423f6
        ed25519:7cd11f018d5211f49b2fb965f18577d7f45c0c9be2a79f5467e18d0105ac1feb
        ed25519:26443e74981d5b528ef481909a208178371173ff7ccee8009d4ebe82ddb09e1e

So someone hosting their own hackage repo can provide instructions with a sample cabal.config or a block of text like the above to copy and paste into their config file, for people to use to get started.

This more flexible syntax will also allow miscellaneous other repository settings such as specific mirrors to use, or the ability to turn off security entirely.

Mirroring

Another couple people got started on writing a mirror client using the hackage-security library. While mirroring does not need a dedicated tool it is a rather convenient and efficient approach. It means we can use the ordinary HTTP interface rather than providing rsync or another interface and we can still do very bandwidth-efficient synchronisation. The advantage over a generic HTTP mirroring tool is that we have an index of the available packages and we know that existing packages are immutable, so we can simply diff the source and target indexes and copy over the extra packages.

In fact there are already two hackage mirror clients that do this. One of them pulls from one repo and pushes to a “smart” hackage-server. The other pulls from a repo and pushes to a repo hosted via S3. What is missing is the ability to mirror to a simple local set of files. Mirrors don’t have to be full hackage-server instances or S3, they can be ordinary HTTP servers like Apache or nginx that point at a set of files in the right layout. The hackage-security library is a convenient tool to use to write this kind of mirror since it handles all the details of the repository layout, and it supports doing incremental updates of the repository index. In this use case the security checks are merely sanity checks, as in the end, clients downloading from this mirror do their own checks.

So the work started by taking the existing hackage-server mirror and hackage-security demo client with the goal of replacing (or extending) the guts of the mirror client to use the hackage-security lib to download and to be able to manage a target repo as a set of local files.

Once the security work is integrated it will become much more useful to have public mirrors because clients then don’t need to trust the mirrors (we’re safe from MITM attacks). And hackage will distribute a list of public mirrors that the clients will use automatically. So having a decent mirroring client will become rather important. It’s also useful for the synchronisation to be very efficient so that the public mirrors can be nearly live copies.

Solving the cabal sandbox / global packages problem

A problem that people have been complaining about recently is that the Haskell Platform ships with lots of packages in the global package database, making it hard to compile packages that require non-standard versions of the platform packages.

This isn’t really a problem with the Haskell Platform at all, it’s really a problem with how cabal-install constructs its sandboxes, and fortunately it’s one that seems relatively easy to fix. Good progress was made on this ticket over the hackathon and hopefully it will be completed within the next couple weeks.

The problem is that cabal sandbox init makes an environment where with a package database stack consisting of the global one, plus a new empty local one. This means all the global packages are implicitly inside the sandbox already. That’s not so useful when you want to start with a minimal sandbox.

Originally this was a GHC limitation, that we always had to use the global package DB, however that has been fixed now for a couple GHC releases. So the solution we went for is to use only a local empty package DB, and to copy the registration information for a certain set of core packages into the local package DB. Ultimately we would like GHC to come supplied with the list of core packages, but for now we can just hard code the list.

Improving the tagging feature on Hackage

One new contributor started work on reimplementing the Hackage website’s tagging feature to make it more flexible and useful. The key idea is to make package categories into tags and make it easier to curate tags by solving the problem that lots of tags are essentially aliases for each other. This happens because each package author picks their tags themselves. So we will have sets of tag aliases, each with a canonical representative. Then any package using any alias will be assigned the canonical tag. The data model and user interface will make it possible for trustees to decide which existing tags ought to be aliased together and then do it. Ultimately, the tags and aliases should be useful in the existing hackage search.

Supporting curated package collections in Cabal and Hackage

Curated package collections are one of the two major parts to solving Cabal Hell.

People started work on supporting these natively in Cabal and Hackage. The idea is that proper integration will make them easier to use, more flexible and easier for people to make and distribute curated collections. Examples of curated collections include stackage (LTS and nightly snapshots) and the sets of versions distributed by Linux distros. Integration will allow simpler and shorter configurations, easier switching between collections and the ability to easily define collections either to distribute on Hackage or to use locally. By teaching cabal about collections it can give better error messages (e.g. when something cannot be installed because it’s not consistent with the collection(s) in use). Making collections easier to distribute via Hackage, and easier to combine, might open up new possibilities. For example we might see collections specifically to help people work with the popular web stacks (e.g. if those cannot always fit into a large general purpose collection). Or we might see collections of things you might like to avoid such as deprecated or known-broken packages. Combining collections would then allow you to configure cabal to use a large collection intersected with the negation of the collection of deprecated packages.


Parametricity Tutorial (Part 1)

Saturday, 23 May 2015, by Edsko de Vries.
Filed under coding.

A powerful feature of Haskell’s type system is that we can deduce properties of functions by looking only at their type. For example, a function of type

f :: a. a -> a

can only be the identity function: since it must return something of type a, for any type a, the only thing it can do is return the argument of type a that it was given (or crash). Similarly, a function of type

f :: a. a -> a -> a

can only do one of two things: either return the first argument, or return the second. This kind of reasoning is becoming more and more important with the increasing use of types such as this definition of a “lens”:

type Lens s t a b = forall f. Functor f => (a -> f b) -> s -> f t

Since a lens is just a function of a particular type, the only thing we can conclude about such a function is whatever we can deduce from its type.

To reason about the properties of functions based only on their types we make use of the theory of parametricity, which tells how to derive a so-called “free theorem” from a type. This blog post is a tutorial on how to do this; it won’t explain why the theory works, only how it works. A Haskell practitioner’s guide to parametricity, if you will.

This is a two-part blog post. In part 1 (this post) we will cover the basics: constant types, functions and polymorphism (over types of kind *).

In part 2 (to be published shortly) we will deal with more advanced material: type constructor (types of kind * -> *), type classes, polymorphism over type constructors and type constructor classes.

The Basics

The main theorem of parametricity is the following:

if f :: t then f ℛ(t) f

When t is a closed type, ℛ(t) is a relation between two terms of type t (we shall see later that the type of is actually slightly more general). In words, parametricity states that any term f of type t is related to itself by ℛ(t). Don’t worry if this all looks incredibly abstract! We shall see lots and lots of examples.

Constant types (types of kind *)

For any constant type C, the relation ℛ(C) is the identity relation. In other words,

     x ℛ(C) x'
iff  x ≡ x'

(We will use ≡ throughout to mean mathematical equality, to distinguish it from Haskell’s equality function (==).)

Let’s see an example. Suppose that x :: Int. Then parametricity tells us that

     x ℛ(Int) x
iff  x ≡ x

I.e., it tells us that x is equal to itself. Not very insightful! Intuitively, this makes sense: if all we know about x is that it is an integer, we cannot tell anything about its value.

TOOLING. Many of the examples in this blog post (though sadly not all) can also be auto-derived by one of two tools. On the #haskell IRC channel we can ask lambdabot to derive free theorems for any types not involving type classes or type constructor classes. If you ask

@free x :: Int

lambdabot will reply with

x = x

(I recommend starting a private conversation with lambdabot so you avoid spamming the whole #haskell channel.)

Alternatively, you can also try the online free theorem generator. This free theorem generator is a bit more precise than lambdabot (which takes some shortcuts sometimes), and supports type classes, but cannot work with type constructors (lambdabot can work with unknown type constructors but not with quantification over type constructors, unfortunately).

Functions

For functions we map related arguments to related results:

     f ℛ(A -> B) f'
iff  forall x, x'.
       if x ℛ(A) x' then f x ℛ(B) f' x'

(The types of x and x' here depend on what precisely A is; see The type of ℛ, below.)

Example. Suppose f :: Int -> Bool. By parametricity

     f ℛ(Int -> Bool) f
iff  forall x :: Int, x' :: Int.
       if x ℛ(Int) x' then f x ℛ(Bool) f x'
-- both Int and Bool are constant types
iff  forall x :: Int, x' :: Int.
       if x ≡ x' then f x ≡ f x'
-- simplify
iff  f ≡ f

Again, not very informative. Parametricity doesn’t tell us anything about functions between constant types. Time to look at something more interesting!

Polymorphism (over types of kind *)

The definition for polymorphic values is

     f ℛ(a. t) f'
iff  forall A, A', a :: AA'.
       f@A ℛ(t) f'@A'           -- where 'a' can occur free in t

That is, whenever we pick two types A and A', and some relation a between A and A', the function f@A obtained by instantiating the type variable by A must be related to the function f'@A' obtained by instantiating the type variable by A'. In what follows we will write explicit type instantiation like this only if the type is not clear from the context; specifically, we will omit it when we supply arguments to the function.

The type of ℛ.

∀ab. a -> b -> a is an example of a closed type: all type variables are bound by a universal quantifier. An open type is a type with free type variables such as ∀b. a -> b -> a or even a -> b -> a. (Note that this distinction is harder to see in Haskell where universal quantifiers are often implicit. We will not follow that convention in this article.)

We said in the introduction that if t is a closed type, ℛ(t) relates two terms of type t. As we saw, in order to be able to give a meaning to open types we need a mapping from any free variable a to a relation a :: A ⇔ A'. In this article we somewhat informally maintain this mapping simply by using the same name for the type variable and the relation.

Given two relations a :: A ⇔ A' and b :: B ⇔ B', ℛ(a -> b -> a) relates terms of type A -> B -> A with terms of type A' -> B' -> A'. It is important to realize that can therefore relate terms of different types. (For a more precise treatment, see my Coq formalization of a part of this blog post.)

The interpretation of for a free type variable a is defined in terms of the corresponding relation:

     x ℛ(a) x'     -- the type variable
iff  (x, x') ∈ a    -- the relation

Example: ∀a. a -> a

Let’s consider a number of examples, starting with an f :: ∀a. a -> a:

     f ℛ(a. a -> a) f
-- parametricity
iff  forall A, A', a :: AA'.
       f@A ℛ(a -> a) f@A'
-- definition for function types
iff  forall A, A', a :: AA', x :: A, x' :: A'.
       if x ℛ(a) x' then f x ℛ(a) f x'

It might not be immediately evident from the last line what this actually allows us to conclude about f, so let’s look at this a little closer. A function g is a special kind of relation, relating any argument x to g x; since the property holds for any kind of relation a : A ⇔ A', it must also hold for a function a⃯ :: A -> A':

     forall x, x'.
       if x ℛ(a⃯) x' then f x ℛ(a⃯) f x'
-- x ℛ(a⃯) x' iff a⃯ x ≡ x'
iff  forall x :: A, x' :: A'.
       if a⃯ x ≡ x' then a⃯ (f x) ≡ f x'
-- simplify
iff  forall x :: A,
       a⃯ (f x) ≡ f (a⃯ x)

We can apply this result to show that any f :: ∀a. a -> a must be the identity function: picking a⃯ = const x, we get const x (f x) ≡ f (const x x), i.e. x ≡ f x, as required.

NOTE. We are doing fast and loose reasoning in this tutorial and will completely ignore any totality issues. See Automatically Generating Counterexamples to Naive Free Theorems, or the associated web interface, for some insights about what wrong conclusions we can draw by ignoring totality.

Example: ∀a. a -> a -> a

Intuitively, there are only two things a function f :: ∀a. a -> a -> a can do: it can either return its first argument, or it can return its second argument. What does parametricity tell us? Let’s see:

     f ℛ(a. a -> a -> a) f
iff  forall A, A', a :: AA'.
       f@A ℛ(a -> a -> a) f@A'
-- applying the rule for functions twice
iff  forall A, A', a :: AA', x :: A, x' :: A', y :: A, y' :: A'.
       if x ℛ(a) x', y ℛ(a) y' then f x y ℛ(a) f x' y'

Let’s again specialize the last line to pick a function a⃯ :: A -> A' for the relation a:

     forall x :: A, x' :: A', y :: A, y' :: A'.
       if x ℛ(a⃯) x', y ℛ(a⃯) y' then f x y ℛ(a⃯) f x' y'
-- a⃯ is a function
iff  forall x :: A, y :: A.
       if a⃯ x ≡ x' and a⃯ y ≡ y' then a⃯ (f x y) ≡ f x' y'
-- simplify
iff  a⃯ (f x y) = f (a⃯ x) (a⃯ y)

So parametricity allows us to “push in” or “distribute” a⃯ over f. The fact that f must either return its first or its second argument follows from parametricity, but not in a completely obvious way; see the reddit thread How to use free theorems.

 Example: ∀ab. a -> b

Other than undefined (which we are ignoring), there can be no function f :: ∀ab. a -> b. Let’s suppose that one did exist; what does parametricity tell us?

     f ℛ(ab. a -> b) f
-- applying the rule for universal quantification, twice
iff  forall A, A', B, B', a :: AA', b :: BB'.
       f@A,B ℛ(a -> b) f@A',B'
-- applying the rule for functions
iff  forall A, A', B, B', a :: AA', b :: BB', x :: A, x' :: A'.
       if x ℛ(a) x' then f x ℛ(b) f x'

Picking two functions a⃯ :: A -> A' and b⃯ :: B -> B' for a and b, we get

b⃯ . f = f . a⃯

It’s not too difficult to derive contradiction from this (remember that you can pick any two functions a⃯ and b⃯ between any types of your choosing). Hence, such a function cannot exist.

Example: ∀ab. (a -> b) -> a -> b

The only thing a function of this type can do is apply the supplied function to the supplied argument (alternatively, if you prefer, this must be the identity function). Let’s spell this example out in a bit of detail because it is our first example of a higher order function.

     f ℛ(ab. (a -> b) -> a -> b) f
-- apply rule for polymorphism, twice
iff  forall A, A', B, B', a :: AA', b :: BB'.
       f@A,B ℛ((a -> b) -> a -> b) f@A',B'
-- apply rule for functions, twice
iff  forall A, A', B, B', a :: AA', b :: BB'.
     forall g :: A -> B, g' :: A' -> B', x :: A, x' :: A'.
       if g ℛ(a -> b) g' and x ℛ(a) x' then f g x ℛ(b) f g' x'

Let’s expand what that premise g ℛ(a -> b) g' means:

     g ℛ(a -> b) g'
iff  forall y :: A, y' :: A'.
       if y ℛ(a) y' then g y ℛ(b) g' y'

For the special case that we pick functions a⃯ :: A -> A' and b⃯ :: B -> B' for a and b, that premise collapses to

     forall y :: A, y' :: A'.
       if y ℛ(a⃯) y' then g y ℛ(b⃯) g' y'
-- a⃯ and b⃯ are functions
iff  forall y :: A, y' :: A'.
       if a⃯ y ≡ y' then b⃯ (g y) ≡ g' y'
-- simplify
iff  forall y :: A.
       b⃯ (g y) ≡ g' (a⃯ y)
-- simplify (extensionality)
iff  b⃯ . g ≡ g' . a⃯

So that the free theorem for f :: ∀ab. (a -> b) -> a -> b becomes

if b⃯ . g ≡ g' . a⃯ then b⃯ . f g ≡ f g' . a⃯

Picking b⃯ = const g, g' = g, and a⃯ = id (verify that the premise holds) we get that indeed g ≡ f g, as expected.

Useful shortcut. This pattern is worth remembering:

      g ℛ(a⃯ -> b⃯) g'
iff   b⃯ . g ≡ g' . a⃯
whenever a⃯ and b⃯ are function(al relation)s.

Example: ∀ab. (∀c. c -> String) -> a -> b -> String

A function f :: ∀ab. (∀c. c -> String) -> a -> b -> String is not only higher order, but has a rank-2 type: it insists that the function it is given is itself polymorphic. This makes it possible to write, for example

f g x y = g x ++ g y

Note that since x and y have different types, it is important that g is polymorphic. What is the free theorem for f?

     f ℛ(ab. (c. c -> String) -> a -> b -> String) f
-- apply rule for polymorphism, twice
iff  forall A, A', B, B', a :: AA', b :: BB'.
       f ℛ((c. c -> String) -> a -> b -> String) f
-- apply rule for functions three times, and simplify ℛ(String) to (≡)
iff  forall A, A', B, B', a :: AA', b :: BB'.
     forall g :: c. c -> String, g' :: c. c -> String.
     forall x :: A, x' :: A', y :: B, y' :: B'.
       if
         g ℛ(c. c -> String) g', x ℛ(a) x', y ℛ(b) y'
       then
         f g x y ≡ f g' x' y'

Specializing this theorem to functions a⃯ :: A -> A' and b⃯ :: B -> B' we get

forall A, A', B, B', a⃯ :: A -> A', b⃯ :: B -> B'.
forall g :: c. c -> String, g' :: c. c -> String.
forall x :: A, y :: B.
  if
    g ℛ(c. c -> String) g'
  then
    f g x y ≡ f g' (a⃯ x) (b⃯ y)

But that is somewhat surprising, because it seems to say that the values of x and y cannot matter at all! What is going on? Expanding the first premise:

     g ℛ(c. c -> String) g'
iff  forall C, C', c :: CC',
       g ℛ(c -> String) g'
iff  forall C, C', c :: CC', z :: C, z' :: C'.
       if z ℛ(c) z' then g z = g' z'

Let’s stop for a moment to ponder what this requirement for g and g' really says: given any relation c, and any elements z and z' that are related by c—in other words, for any z and z' at all—we must have that g z and g' z' give us equal results. This means that g and g' must be constant functions, and the same constant function at that. As a consequence, for any function f of the above type, f g must itself be constant in x and y. In part two we will see a more useful variation which uses the Show type class.

Incidentally, note that this quantification over an arbitrary relation c is a premise to the free theorem, not a conclusion; hence we cannot simply choose to consider only functions c.

TOOLING. Indeed, if you enter

(forall c. c -> String) -> a -> b -> String
in the online free theorem generator you will see that it first gives the free theorem using relations only, and then says it will reduce all “permissible” relation variables to functions; in this example, that is all relations except for c; lambdabot doesn’t make this distinction and always reduces relations to functions, which is not correct.

To be continued

In part 2 (to be published shortly) we will cover type constructors, type classes and type constructor classes. Meanwhile, here are some links to papers on the subject if you want to read more.


Recent Hackage improvements

Monday, 18 May 2015, by Duncan Coutts.
Filed under cabal, community.

You may or may not have noticed but over the last few months we’ve had a number of incremental improvements to the Hackage site, with patches contributed by numerous people.

I’m very pleased that we’ve had contributions from so many people recently. Apart from one patch that took a long time to merge we’ve generally been reasonably good at getting patches merged. Currently there’s just 1 outstanding pull request on hackage-server’s github site.

I gave a talk a couple months ago at the London Haskell User Group about getting started with hacking on Cabal and Hackage. Unfortunately the video isn’t yet available. (I’ll try and chase that up and link it here later).

An idea we floated at that talk was to run a couple hackathons dedicated to these and related infrastructure projects. If you want to help organise or have a venue in London, please get in touch. If you can’t get to London, fear not as we’d also welcome people attending online. Of course there’s also the upcoming ZuriHac where I expect there will be plenty of infrastructure work going on.

If you do want to get involved, the github site is the place to start. Discussion of features happens partly in issues on github and the #hackage channel on IRC. So those are good places to get feedback if you decide to start working on a bug or feature.

Recent changes

Visible changes

A few boring but important ones

Miscellaneous small changes


Improving Hackage security

Thursday, 16 April 2015, by Duncan Coutts.
Filed under industrial-haskell-group, cabal, community.

The members of the Industrial Haskell Group are funding work to improve the security of packages downloaded from Hackage. The IHG members agreed to fund this task some time ago and my colleague Austin and I have been working on the design and implementation.

In this post I will explain the problem and the design of our solution.

TL;DR

We’re implementing a system to significantly improve Hackage security. It’s based on a sensible design (The Update Framework) by proper crypto experts. The basic system is fully automatic and covers all packages on Hackage. A proposed extension would give further security improvements for individual packages at the cost of a modest effort from package authors.

It will also allow the secure use of untrusted public Hackage mirrors, which is the simplest route to better Hackage reliability. As a bonus we’re including incremental index downloads to reduce cabal update wait times. And it’s all fully backwards compatible.

Goals

The general aim is to make a significant incremental improvement to Hackage security, given limited resources to spend on the project. So that means covering as many users, packages and likely threats as we can.

We have had to think carefully about the implementation effort and what improvements will give the biggest security benefits. Our general approach has been to go for the biggest improvements first, but to leave the door open to further tightening later.

Crypto-humility and initial thoughts

Both Austin and I have some experience with cryptography and computer security. Austin used to work for a computer security company, and many years ago I worked at an anti-virus company on the problem of securing the software update process. So we know the general options. Importantly, we also know enough to know that we don’t know everything. Hopefully the combination will be enough to avoid security disasters like the recent Docker image download vulnerability.

The basic options are:

Our initial thoughts were that having the server sign the index gives a bigger bang for the buck than a scheme with author signing. This is because with the server doing the signing we can cover all packages and not require any knowledge or effort by either the package authors or the developers downloading the packages. With author signing we fear we would not be able to cover the majority of packages, at least not in the near term. Without covering all the packages we cannot have untrusted Hackage mirrors. Our rough estimate was also that index signing would be less effort to implement. Of course we’re well aware that index signing provides shallower security than the end-to-end verification with author signing.

So while we started with a general idea of the kind of solutions that would fit with Hackage and our budget, we looked around for designs and analysis from better experts than ourselves.

Austin pointed out some prior art called The Update Framework (abbreviated TUF). TUF is an architecture for securing software update systems, though there is also a reference implementation in Python. It has been designed by a number of academics, based on a few published papers and practical experience with the Tor Project’s update system. There is ongoing work to use TUF to secure Python and Ruby’s package distribution systems (both of which are fairly similar to Hackage).

What is good about TUF is that it considers the problem as a whole, with a wide range of threats. It has what security researchers call a “threat model”, which is needed to be able to discuss with clarity what kinds of threats a system can protect against. As an architecture it is also flexible enough to cover a fairly wide range of update systems: one can use subsets and still be clear about what protections one has.

The elements of TUF are still based on the basic ideas of end-to-end signing and index signing, but it is put together in a comprehensive way. There are several attacks that TUF considers and protects against that we did not consider with our initial thoughts on index signing.

Indeed even if one goes for a system based only on end-to-end author-based signing (e.g. individual GPG-signed packages) then there are still several possible attacks. In retrospect this is clear, but it was a bit of a surprise since packages individually GPG-signed by their authors is often considered as the gold standard.

The Update Framework

TUF is designed to protect against adversaries that can interfere with network traffic between you and the repository server, for example someone operating a wifi network you’re connected to, or in control of a proxy or operating a mirror of the repository. To a degree it is also designed to protect against adversaries than can compromise a repository server itself.

In a full implementation of the architecture, TUF subsumes both server-based index signing and end-to-end verification with author signing.

It relies on a number of different cryptographic keys with different roles. There are one or more root keys and a few subordinate keys for particular purposes. It does not rely on an external PKI, like the public certificate authority system that SSL/TLS uses.

The key idea with having multiple keys with different roles is that one can design a system in such a way that compromising one key is not enough to break the whole system. This relies on different keys living in different places so that compromising one doesn’t automatically imply compromising the others. Some keys have to live on public facing servers and so are at greatest risk, while others can be kept offline or on private servers and so are much harder to compromise.

Attacks

TUF provides protection against a number of attacks. If you want to imagine a concrete example, consider that the attacker runs or has compromised a mirror of Hackage that you use, and then think about each of these attacks:

Interestingly, a naive approach based only on authors signing packages only solves the first of the attacks above. All the others are still possible, and several of them can lead directly to compromised client systems, while others can force you to use old possibly vulnerable software, or just DOS your system.

Roles for keys

TUF defines a number of roles for keys:

Client verification process

In a full implementation, including delegated target keys, the client download and verification process goes like this:

At this point we know (with cryptographic evidence) whether the package index has changed and what size and hash to expect when we download it. We also have a bound on the freshness of this information because the timestamp signature includes a short (e.g. day or two) expiry time.

At this point we have downloaded the package index successfully and know (again with cryptographic evidence) that it has not been tampered with since being created on the server and so we can trust all the metadata in the index.

At this point we have downloaded an individual package and we know (with cryptographic evidence) that the package tarball has not been modified since it was created on the author’s system, and that this author is authorised to provide this named package.

Discussion

The different keys provide assurance of slightly different (but somewhat overlapping) things. If we only care about man-in-the-middle attacks then we only need the timestamp key and signature, since it includes the hash of the index which includes hashes of the packages.

The addition of the snapshot key to sign the index allows the snapshot key to be kept on a different system from the public-facing index server (think of an organisation like Red Hat with an internal system only available to people uploading packages). This timestamp/snapshot separation would allow the public-facing index server to be compromised with the damage being limited to freeze attacks (and the duration of this is bounded).

The addition of the target key (and delegated target keys) gives additional protection in the case that the index server is compromised. When target keys are being used, an attacker that has taken control of the index server can perform freeze attacks, rollback and mix and match attacks, but cannot change the contents of individual packages, nor provide packages that they are not authorised to provide.

Again, it is important to emphasise that all the keys need to be kept in separate places for them to have any security benefit.

Note that slow download and endless data attacks can be prevented because except for the initial tiny timestamp file, the client knows (securely) the expected download size and so carefully written download code can at least notice slow download and endless data attacks.

The Update Framework for Hackage

TUF is an architecture (albeit with a reference implementation) designed to be flexible enough to cover many use cases (distros, individual apps etc) and so we are free to adapt it for our particular use case.

For the case of Hackage as currently designed, the timestamp/snapshot key distinction does not add anything, as both keys would need to be on the same system. This is because the Hackage server is public facing both for download and upload. There would need to be a separate more secure upload server for the timestamp/snapshot distinction to add value. So for the Hackage implementation we will merge the timestamp and snapshot keys into a single package index signing key.

We also need to adapt the format and layout of the various bits of TUF metadata to fit the existing Hackage index format. For the most part this is fairly straightforward. The timestamp file can be provided as a separate file over HTTP, since it needs to be small and downloaded before the package index. All the other files that TUF specifies can be included inside the Hackage index file. This is straightforward because the Hackage index is simply a tarball and is designed to be extensible with extra file entries.

Our development plan had us start with the package index signing but keeping in mind the target keys feature so that it can be added later without any compatibility problems. We have enough funding to complete the package index signing and we hope that with this clear design roadmap we can secure either further funding or community volunteer effort.

Implementing target keys later would involve client tools (like cabal) generating keys and an out-of-band mechanism to submit them to be signed by the holder of the central target key (in practice the Hackage admins). For updating who is allowed to upload which package, a semi-manual process will be needed for the central target key holders to check and authorise re-signing the key delegation information. Tools like cabal would also need to be extended to sign tarballs and upload the signatures. So there would be some extra work for package authors but it could be kept fairly minimal. The main cost would be that changing the set of maintainers for a package on Hackage would become an asynchronous process involving a Hackage admin re-signing something.

As a final point, Hackage allows package authors and Hackage trustees to edit .cabal file metadata after release. So when we add target keys then these parts of the index should be protected with the target keys, not just the main package index key.

Comparing with other approaches

HTTPS

HTTPS is often mentioned as a way to improve Hackage security. As the TUF papers make clear, relying on the public PKI for security leaves a lot to be desired, given all the well-documented problems with trusting all the public certificate authorities in the world. In addition, relying on HTTPS does not allow for untrusted mirrors, whereas TUF’s index signing approach does allow that. Finally, on a practical note HTTPS is unfortunately complicated to support sanely on all platforms (OpenSSL is horrible, and distributing it for platforms that don’t come with it is painful). By contrast, TUF only relies on a crypto hash and signature verification algorithm. There are many choices for these including good choices that are small enough to embed existing audited C code directly into a Haskell library or tool. We plan to use ed25519 and in particular the C implementation extracted from NaCl.

GnuPG

Another approach that has been discussed and had some development effort is one based on author signing with GnuPG by Nikita Karetnikov. It was designed to fit with the existing Hackage index format with signatures being included in the package index tarball. The primary difficulty with a GnuPG-based system is in establishing the web-of-trust to a sufficient degree that signature verification can be done fully automatically without supervision from the user.

More GnuPG

Another recent GnuPG-based proposal comes from Chris Done. This proposal is under active discussion so in the spirit of constructive criticism it’s worth comparing it with TUF.

The first general point is that TUF has been designed by academic experts in the subject, based both on research and existing real-world systems. Our crypto-humility should cover not just crypto algorithms but extend to whole system designs. As I mentioned at the beginning of this post, though my initial thoughts on package index signing looked quite similar to the root+snapshot key part of the TUF system, there are many extra details that TUF covers that I would not have considered. With these in mind we can already spot underspecified areas in the proposal. For example one part of the spec says “Downloads the index from the web server or loads from cache if not modified”, but this leaves open the question of how we know securely if the index has been modified or not. TUF addresses this with the signed timestamp file. This kind of detail is not something I would have considered either, but TUF identifies the problem (that relying on HTTP cache control headers would be vulnerable to MITM) and has a solution. Generally in a system designed by crypto amateurs like us we would expect to have such holes. So we can do better by picking an existing expert design. TUF seems like a reasonable choice since it is designed to be somewhat flexible to fit different situations, and it’s being considered for both Python and Ruby’s package systems.

More specifically, the GnuPG-based proposal relies on a web of trust in which individual end users have to decide who they trust and configure their tools appropriately. A more recent version of the proposal includes signed mappings from packages to signers, which are basically statements of “I trust so and so to upload package foo”, but you still need to trust the author of that statement.

The system seems to be designed primarily for enterprise users who are using a carefully selected set of packages where a level of manual supervision to decide which authors to trust is justified. By contrast TUF’s target key delegation system is designed to be fully automatic for end users.

With TUF’s target key delegation, the chain of trust is very clear and simple: the root key signs the target key, which signs delegated keys and delegation information, and the delegated keys sign individual packages.

I also think it’s unfortunate that the signatures are not integrated as part of the Hackage index as in Nikita’s system (though it is in keeping with the general 3rd party design). On the positive side, the system has the flexibility to have 3rd parties make claims of trustworthiness, such as “I have manually audited this code” or “My CI system says this builds ok without obviously attempting to attack the system”. That extra complexity adds to the amount of manual supervision needed and seems to have been left out of the latest version of the proposal.

Finally because the proposal has nothing corresponding to the TUF snapshot key signature then all the corresponding attacks are possible, including rollback, freeze and mix-and-match attacks.

A central authority

A suggestion that has come up in recent discussion is that there should be a central authority for who is allowed to upload which package and that that information should be transparent. The criticism is that while Hackage is currently that central authority, it does not generate any verifiable evidence.

With TUF the target key delegation information is precisely that cryptographic evidence. The holders of the target key is that central authority and the signed delegation information is the cryptographic evidence of who is allowed to upload what, and that information is distributed to all clients.

Incremental updates

There has been a design for incremental updates of the Hackage index floating around for some years. With the TUF approach of putting more signatures into the index the pain of slow Hackage index downloads would become critical so our design incorporates that existing design for incremental updates to the index. This means most index updates will be 10s of kilobytes rather than multiple megabytes. This is actually even better than rsync (and likely better than git). The trick is that the tar format was originally designed to be append only (for tape drives) and so if the server simply updates the index in an append only way then the clients only need to download the tail (with appropriate checks and fallback to a full update). Effectively the index becomes an append only transaction log of all the package metadata changes. This is also fully backwards compatible.

Conclusion

We think implementing TUF for Hackage will be a major improvement to security. The initial implementation without target keys will be completely transparent to both end users and authors, and then the later addition of per-author target keys will further improve security by guarding against certain attacks even if the central server is compromised.

We accept that although we are programming experts, we are not crypto experts. We are hopeful that by using TUF we are reusing an existing expert crypto design and will not fall into the traps that typically bedevil our fellow crypto amateurs.


OverloadedRecordFields revived

Monday, 30 March 2015, by Adam Gundry.
Filed under coding, ghc, well-typed.

Way back in the summer of 2013, with support from the Google Summer of Code programme, I implemented a GHC extension called OverloadedRecordFields to address the oft-stated desire to improve Haskell’s record system. This didn’t get merged into GHC HEAD at the time, because the implementation cost outweighed the benefits. Now, however, I’m happy to report that Well-Typed are sponsoring the work required to get an improved version into GHC. Moreover, the first part of the work is already up for review on Phabricator.

The crucial thing that has enabled OverloadedRecordFields to get going again is that we’ve found a way to factor the design differently, so that we get a much better power-to-weight ratio for the extension by splitting it into two parts.

Part 1: Records with duplicate field labels

The first step is to cut down the core OverloadedRecordFields extension as much as possible. The essential idea is the same as it ever was, namely that a single module should be able to use the same field name in multiple datatypes, as in this example:

data Person  = Person  { personId :: Int, name :: String }
data Address = Address { personId :: Int, address :: String }

These definitions are forbidden in normal Haskell, because personId is defined twice, but the OverloadedRecordFields extension will permit them and instead postpone name conflict checking to use sites. The basic extension will require that fields are used in such a way that the relevant datatype is always unambiguously determined, and the meanings of record construction, pattern matching, selection and update will not change. This means that the extension can always be enabled for an existing module and it will continue to compile unchanged, an important property that was not true of the previous design.

The Haskell syntax for record construction and pattern-matching is always unambiguous, because it mentions the data constructor, which means that code like this is perfectly fine:

p = Person { personId = 1, name = "Donald" }
getId (Person { personId = i }) = i

On the other hand, record selector functions are potentially ambiguous. The name and address selectors can be used without restrictions, and with their usual types, but it will not be possible to use personId as a record selector if both versions are in scope (although we will shortly see an alternative).

Record update is a slightly more interesting case, because it may or may not be ambiguous depending on the fields being updated. In addition, since updates are a special syntactic form, the ambiguity can be resolved using a type signature. For example, this update would be ambiguous and hence rejected by the compiler:

f x = x { personId = 0 } -- is x a Person or an Address?

On the other hand, all these updates are unambiguous:

g :: Person -> Person
g x = x { personId = 0 }                 -- top-level type signature

h x = x { personId = 0 } :: Person       -- type signature outside

k x = (x :: Person) { personId = 0 }     -- type signature inside

l x = x { personId = 0, name = "Daffy" } -- only Person has both fields

Overall, this extension requires quite a bit of rewiring inside GHC to distinguish between field labels, which may be overloaded, and record selector function names, which are always unambiguous. However, it requires nothing conceptually complicated. As mentioned above, the implementation of this part is available for review on Phabricator.

Part 2: Polymorphism over record fields

While the OverloadedRecordFields extension described in part 1 is already useful, even though it is a relatively minor relaxation of the Haskell scoping rules, another important piece of the jigsaw is some way to refer to fields that may belong to multiple datatypes. For example, we would like to be able to write a function that selects the personId field from any type that has such a field, rather than being restricted to a single datatype. Much of the unavoidable complexity of the previous OverloadedRecordFields design came from treating all record selectors in an overloaded way.

But since this is new functionality, it can use a new syntax, tentatively a prefix # sign (meaning that use of # as an operator will require a space afterwards when the extension is enabled). This means that it will be possible to write #personId for the overloaded selector function. Since we have a syntactic cue, it is easy to identify such overloaded uses of selector functions, without looking at the field names that are in scope.

Typeclasses and type families will be used to implement polymorphism over fields belonging to record types, though the details are beyond the scope of this blog post. For example, the following definition is polymorphic over all types r that have a personId :: Int field:

getId :: r { personId :: Int } => r -> Int
getId x = #personId x

Moreover, we are not limited to using #personId as a selector function. The same syntax can also be given additional interpretations, allowing overloaded updates and making it possible to produce lenses for fields without needing Template Haskell. In fact, the syntax is potentially useful for things that have nothing to do with records, so it will be available as a separate extension (implied by, but distinct from, OverloadedRecordFields).

Further reading

More details of the redesigned extensions are available on the GHC wiki, along with implementation notes for GHC hackers. Last year, I gave a talk about the previous design which is still a good guide to how the types work under the hood, even though it predates the redesign.


Qualified Goals in the Cabal Solver

Friday, 27 March 2015, by Edsko de Vries.
Filed under cabal, coding, industrial-haskell-group.

When you ask cabal-install to install one or more packages, it needs to solve a constraint satisfaction problem: select a version for each of the packages you want to install, plus all their dependencies, such that all version constraints are satisfied. Those version constraints come both from the user (“please install lens version 4”) and from the packages themselves (“this package relies on mtl between version 2.1 and 2.2”). Internally cabal-install uses a modular constraint solver, written by Andres Löh. It was first presented at the Haskell Implementor’s Workshop in 2011, and is based on the paper Modular lazy search for Constraint Satisfaction Problems.

For the Industrial Haskell Group, Well-Typed has recently extended the constraint solver to deal with qualified goals. In this blog post we will explain how this constraint solver works by means of a running example, and look at how the new qualified goals change things.

Overview of the Solver

The ability to be able to decouple generating solutions from finding the right one is one of the classical reasons why functional programming matters, and cabal-install’s constraint solver makes very heavy use of this. It first builds a tree with lots of “solutions”; solutions in quotes because many of these solutions will not be valid. It then validates these solutions, marking any invalid ones. This might still leave many possible solutions, so after this we apply some preferences (we prefer newer packages over older ones for example) and heuristics (we want to pick the version of base that is already installed, no point considering others) and then it uses the first solution it finds (if any).

It is very important to realize throughout all this that these trees are never built in their entirety. We depend on laziness to only evaluate as much as necessary. A key design principle throughout the solver is that we must have enough information at each node in the tree to be able to make local decisions. Any step that would require looking at the tree in its entirety would be a big no-no.

In the remainder of this section we will see what these different steps look like using a running example.

Building the tree

Suppose we have a package database with base version 4.0, mtl versions 1.0 and 2.0, both of which depend on base, and a package foo that depends on both base and mtl version 2.0.

When we ask cabal to install package foo, it constructs a search tree that looks something like this:

GoalChoice nodes, shown as G nodes in the diagram, represent points where we decide on which package to solve for next. Initially we have only one option: we need to solve for package foo. Similarly, PChoice nodes P represent points where we decide on a package version. Since there is only one version of foo available, we again only have one choice.

Once we have chosen foo version 1.0, we need to solve for foo's dependencies: base and mtl. We don’t know which we should solve for first; the order in which we consider packages may affect how quickly we find a solution, and which solution we return (since we will eventually report the first solution we find). When we build the tree we essentially make a arbitary decision (depending on which order we happen to find the dependencies), and we record the decision using a GoalChoice node. Later we can traverse the tree and apply local heuristics to these GoalChoice nodes (for instance, we might want to consider base before mtl).

In the remainder of the tree we then pick a version for mtl (here we do have a choice in version), and then a version for base, or the other way around. Note that when we build the tree, package constraints are not yet applied: in the tree so far there is nothing that reflects the fact that foo wants version 2.0 of mtl, and every path ends with a Done node D, indicating success. Indeed, we would get precisely the same tree if we have a package DB

where foo depends on either version of mtl.

Validation

Once we have built the tree, we then walk over the tree and verify package constraints. As soon as we detect a violation of a constraint on a given path we replace that node in the tree with a failure node. For the above example this gives us the following tree:

Paths through the tree that lead to failure are not removed from the tree, but are replaced by explicit failure F. This helps with generating a good error message if we fail to find a solution. In this case, both failures are essentially the same problem: we cannot pick version 1.0 for mtl because foo needs version 2.0.

Heuristics and Preferences

After validation we apply a number of heuristics to the tree. For example, we prefer to pick a version of base early because there is generally only one version of base available in the system. In addition, we apply user preferences; for example, we try newer versions of packages before older versions. For our example this gives us

Finding a solution

After applying the heuristics we throw away all but the first choice in each GoalChoice node (but keeping all choices in the PChoice nodes)

and traverse the tree depth first to find a solution, returning the first solution we find. In our running example, this means that we will use version 1.0 of foo, 4.0 of base and 2.0 of mtl.

Whenever we encounter a Fail node we backtrack. This backtracking is aided by so-called conflict sets. I haven’t shown these conflict sets in the diagrams, but each Fail node in the tree is annotated with a conflict set which records why this path ended in failure. In our running example the conflict set for both Fail nodes is the set {foo, mtl}, recording that there is a conflict between the versions of mtl and the version of foo that we picked. The conflict set is used to guide the backtracking; any choice that we encounter while backtracking that does not involve any variables in the conflict set does not need to be reconsidered, as it would lead to the same failure.

If we cannot find any solution, then we must report an error. Reporting a good error here however is difficult: after all, we have a potentially very large tree, with lots of different kinds of failures. Constructing an informative error from this is difficult, and this is one area where cabal-install might still be improved.

Qualified goals

Motivation

Normally cabal-install can only pick a single version for each package. For example, if we have a situation

we cannot install package D because it would require installing both versions 1.0 and 2.0 of package A (this is known as the diamond problem).

Setup scripts

Cabal packages can however have their own custom Setup scripts, when necessary, which are Cabal’s equivalent of the traditional ./configure scripts. In principle there should be no problem building these Setup scripts using different (and possibly conflicting) dependencies than the library itself; after all, the Setup script is completely independent from the library.

From Cabal 1.23 and up these setup scripts can have their own list of dependencies. Let’s suppose that in our running example the Setup script of foo has a dependency on any version of mtl:

We want to allow the setup script to be compiled against a different version of mtl as foo itself, but of course we would prefer not to, in order to avoid unnecessary additional compilation time.

Qualification

In order to allow picking a different version, we introduce qualified goals for each of the setup dependencies. In our running example, this means that cabal-install will now try to solve for the variables foo, mtl, and base, as well as foo.setup.mtl and foo.setup.base. This makes it possible to pick one version for mtl and another for foo.setup.mtl.

Linking

But how do we make sure that we pick the same version when possible? One (non-) option is to look at the entire search tree and find the solution that installs the smallest number of packages. While that might work in theory, it violates the earlier design principle we started with: we only ever want to make local decisions. The search tree can be quite large; indeed, the addition of the single setup dependency to foo already makes the tree much larger, as we shall see shortly. We certainly never want to inspect the entire search tree (or, worse yet, have the entire search tree in memory).

Instead, we introduce the concept of linking. This means that when we select a version for foo.setup.mtl (say), in addition to being able to pick either version 1.0 or 2.0, we can also say “link the version of foo.setup.mtl to the version of mtl” (provided that we already picked a version for mtl).

Then we can make local decisions: when we pick a version, we prefer to link whenever possible. Of course, this is subject to certain constraints. In the remainder of this section we shall see how qualified goals and linking works using our running example, and identify some of these linking constraints.

Building the tree

The code to build the initial tree is modified to introduce qualified constraints for setup dependencies, but does not itself deal with linking. Instead, it builds the tree as normal, and we then add additional linking options into the tree as a separate phase.

The search tree for our running example gets much bigger now due to combinational explosion: we have two additional variables to solve for, and linking means we have more choices for package versions. Here’s part of the initial search tree:

Let’s follow along the spine of this tree to see what’s going on. We first consider foo and pick version 1.0, just like before (there is no other choice). Then, on this path, we first consider foo.setup.mtl, and we have two options: we can either pick version 1.0 or version 2.0. We pick version 1.0, and continue with foo.setup.base and pick version 4.0 (only one option).

But when we now consider mtl things are more interesting: in addition to picking versions 1.0 and 2.0, we can also decide to link the version of mtl against the version of foo.setup.mtl (indicated by a red label in the tree). Similarly, when we pick a version for base, we can now choose to link base against the version of foo.setup.base in addition to picking version 4.0.

Validation

When we link mtl against foo.setup.mtl, we are really saying “please use the exact same package instance for both mtl and foo.setup.mtl”. This means that the dependencies of mtl must also be linked against the dependencies of foo.setup.mtl.

In addition, ghc enforces a so-called single instance restriction. This means that (in a single package database) we can only have one instance of a particular package version. So, for example, we can have both mtl version 1.0 and mtl version 2.0 installed in the same package database, but we cannot have two instance of mtl version 2.0 (for instance, one linked against version 3.0 of transformers and one linked against version 4.0 oftransformers) installed at the same time. Lifting this restriction is an important step towards solving Cabal Hell, but for now we have to enforce it. In our terminology, this means that when we have to solve for both (say) mtl and foo.setup.mtl, we can either pick two different versions, or we can link one to the other, but we cannot pick the same version for both goals.

So, in addition to the regular validation phase which verifies package constraints, we introduce a second validation phase that verifies these kinds of “linking constraints”. We end up with a tree such as

In this part of the tree, the two failures for mtl are because we picked version 1.0 for foo.setup.mtl, but since foo itself wants mtl version 2.0, we cannot pick version 1.0 for goal mtl nor can we link mtl to foo.setup.mtl. The two failures for base are due to the single instance restriction: since we picked version 4.0 for foo.setup.base, we must link base to foo.setup.base.

Heuristics

If we picked the first solution we found in the tree above, we would select version 1.0 of mtl for foo’s setup script and version 2.0 of mtl for foo itself. While that is not wrong per se, it means we do more work than necessary. So, we add an additional heuristic that says that we should consider setup dependencies after regular (library) dependencies. After we apply this heuristic (as well as all the other heuristics) we end up with

In this part of the tree we see one failure for foo.setup.mtl and two failures for foo.setup.base. The failure for foo.setup.mtl comes from the single instance restriction again: since we picked version 2.0 for mtl, we cannot pick an independent instance for foo.setup.mtl. The failure for foo.setup.base on the right is due to the same reason, but there is an additional reason for the left failure: since we chose to link foo.setup.mtl to mtl, its dependencies (in this case, foo.setup.base) must also be linked.

Finding a solution

As before, after applying the heuristics we prune

and we report the first solution we find. In this case, this means that we will pick version 2.0 for mtl and link foo.setup.mtl to mtl, and foo.setup.base to base.

Conclusions

Although we skimmed over many details, we covered the most important design principles behind the solver and the new implementation of qualified goals. One thing we did not talk about in this blog post are flag assignments; when the solver needs to decide on the value for a flag, it introduces a FChoice node into the tree with two subtrees for true and false, and then proceeds as normal. When we link package P to package Q, we must then also verify that their flag assignments match.

Qualified goals and linking are now used for setup dependencies, but could also be used to deal with private dependencies, to split library dependencies from executable dependencies, to deal with the base-3 shim package, and possibly other purposes. The modular design of the solver means that such features can be added (mostly) as independent units of code. That doesn’t of course necessarily mean the code is also easy; making sure that all decisions remain local can be a subtle problem. Hopefully this blog post will make the solver code easier to understand and to contribute to.


Previous entries