GHC 6.10.1 released!

Tuesday, 04 November 2008, by Ian Lynagh.
Filed under haskell-platform, well-typed.

After months of work, GHC 6.10.1 is finally released!

The headlines include a parallel garbage collector, extensible exceptions, haddock 2, Data Parallel Haskell as an extralib, and much more besides.

Well-Typed is pleased to have been able to play its part in making this release happen, with Ian working with the Simons at GHC HQ and many other developers across the world to bring this release together.

Meanwhile, Duncan visitied Galois to team up with Don Stewart for some extensive pre-release testing. Their experiments with building all of the packages on Hackage give us confidence that this will be a solid release.

Thanks to everyone who played a part, be it development, testing or otherwise, in this release. We couldn't have done it without your help!


Haskell Platform talk at the London Haskell Users Group

Tuesday, 04 November 2008, by Duncan Coutts.
Filed under community, haskell-platform.

I'm talking about the Haskell Platform at the London Haskell Users Group this Thursday.

It is an extended version of the 10-minute talk Haskell: Batteries Included that Don Stewart and I presented at the recent Haskell Symposium.

Abstract:

Some people pick a programming language because it has the best type system, the best facilities for abstraction or perhaps the fastest compiler. Most people however pick a whole programming platform that lets them solve their problem the quickest or best. A programming platform consists of a language, a compiler, and a set of standard libraries and tools. Other popular programming languages have a standard platform that puts together everything you need to get started.

This talk is about the Haskell Platform. We'll cover what the Haskell Platform is and who it is for. We'll also look at the technical infrastructure and the social aspects of how it will be managed.

Not all of the details are set in stone. We need to have a discussion within the Haskell community about how the platform will be managed and extended, especially since it needs buy-in from package maintainers. My hope is to use this talk to kick off that discussion.


zlib and bzlib package updates

Sunday, 02 November 2008, by Duncan Coutts.
Filed under coding.

I'm pleased to announce updates to the zlib and bzlib packages.

The releases are on Hackage:

What's new

What's new in these releases is that the extended API is slightly nicer. The simple API that most packages use is unchanged.

In particular, these functions have different types:

compressWith   :: CompressParams
               -> ByteString -> ByteString
decompressWith :: DecompressParams
               -> ByteString -> ByteString

The CompressParams and DecompressParams types are records of compression/decompression parameters. The functions are used like so:

compressWith   defaultCompressParams { ... }
decompressWith defaultDecompressParams { ... }

There is also a new parameter to control the size of the first output buffer. This lets applications save memory when they happen to have a good estimate of the output size (some apps like darcs know this exactly). By getting a good estimate and (de)compressing into a single-chunk lazy bytestring this lets apps convert to a strict bytestring with no extra copying cost.

Future directions

The simple API is very unlikely to change.

The current error handling for decompression is not ideal. It just throws exceptions for failures like bad format or unexpected end of stream. This is a tricky area because error streaming behaviour does not mix easily with error handling.

On option which I use in the iconv library is to have a data type describe the real error conditions, something like:

data DataStream
   = Chunk Strict.ByteString Checksum DataStream
   | Error Error -- for some suitable error type
   | End Checksum

With suitable fold functions and functions to convert to a lazy ByteString. Then people who care about error handling and streaming behaviour can use that type directly. For example it should be trivial to convert to an iterator style.

People have also asked for a continuation style api to give more control over dynamic behaviour like flushing the compression state (eg in a http server). Unfortunately this does not look easy. The zlib state is mutable and while this can be hidden in a lazy list, it cannot be hidden if we provide access to intermediate continuations. That is because those continuations can be re-run whereas a lazy list evaluates each element at most once (and with suitable internal locking this is even true for SMP).

Background

The zlib and bzlib packages provide functions for compression and decompression in the gzip, zlib and bzip2 formats. Both provide pure functions on streams of data represented by lazy ByteStrings:

compress, decompress :: ByteString -> ByteString

This makes it easy to use either in memory or with disk or network IO. For example a simple gzip compression program is just:

import qualified Data.ByteString.Lazy as BS
import qualified Codec.Compression.GZip as GZip

main = BS.interact GZip.compress

Or you could lazily read in and decompress .gz file using:

content <- GZip.decompress <$> BS.readFile file

General

Both packages are bindings to the corresponding C libs, so they depend on those external C libraries (except on Windows where we build a bundled copy of the C lib source code). The compression speed is as you would expect since it's the C lib that is doing all the work.

The zlib package is used in cabal-install to work with .tar.gz files. So it has actually been tested on Windows. It works with all versions of ghc since 6.4 (though it requires Cabal-1.2).

The darcs repos for the development versions live on code.haskell.org:

I'm very happy to get feedback on the API, the documentation or of course any bug reports.


Some ideas for the Future of Cabal

Tuesday, 07 October 2008, by Duncan Coutts.
Filed under haskell-platform.

I presented a "Tech Talk" today at Galois on some ideas relating to Cabal

We discussed two topics. Here's the abstract:

A language for build systems

Build systems are easy to start but hard to get right. We'll take the view of a language designer and look at where our current tools fall down in terms of safety/correctness and expressiveness.

We'll then consider some very early ideas about what a build system language should look like and what properties it should have. Currently this takes the form of a design for a build DSL embedded in Haskell.

Constraint solving problems in package deployment

We are all familiar, at least peripherally, with package systems. Every Linux distribution has a notion of packages and most have high level tools to automate the installation of packages and all their dependencies. What is not immediately obvious is that the problem of resolving a consistent set of dependencies is hard, indeed it is NP-complete. It is possible to encode 3-SAT or Sudoku as a query on a specially crafted package repository.

We will look at this problem in a bit more detail and ask if the right approach might be to apply our knowledge about constraint solving rather than the current ad-hoc solvers that most real systems use. My hope is to provoke a discussion about the problem.

I've covered similar ground to the first topic before in a previous blog post on make. This time we looked in a bit more detail at what a solution might look like and what properties it should have. In particular we discussed what the correctness properties of a build system might look like.


Slides from the Haskell Platform talk

Friday, 26 September 2008, by Duncan Coutts.
Filed under haskell-platform.

I promised to post the slides from our talk on the Haskell Platform which Don and I presented at the Haskell Symposium yesterday.

Haskell: Batteries Included

Malcolm did us all a great service by videoing the talks. Unfortunately he had to catch his flight home before our talk so there is no video for that one.

Don did the talk with the slides and I did the live demo. Fortunately the demo worked. We ran the new hackage server on my laptop and we invited people to connect. I demoed uploading a new package and within a few seconds people in the audience were able to download and install it using cabal-install. That of course is old hat to the open source Haskell hackers but part of what we were trying to do is to persuade the academics to make better use of Hackage to publish libraries and tools that they develop as part of their research.

One new thing we demoed was generating build reports and uploading them back to the hackage server. In fact we had several people in the audience upload report for the new package within 30 seconds of me uploading it. The build reporting is part of the plan for testing the packages in the Haskell Platform but more generally to gather information on what packages build in what environments.


Hackage hacking and demo

Wednesday, 24 September 2008, by Duncan Coutts.
Filed under community.

Don and I are doing our quick talk about hackage and the Haskell platform tomorrow. Chris and Eelco from Tupil have been helping us prepare some cool visualisations of Hackage. This one shows each package as a circle with the size indicating the number of other packages that use it. So the base package is the biggest of course with 754 other packages that use it.

Packages on hackage

Here we have Neil and the Tupil guys brainstorming about the user interaction and visual design of the new hackage. The big idea is using search (i.e. hoogle) as the primary interface.

Hackage hacking

The new hackage server implementation is something that Lemmih (of HAppS fame) and myself have been working on in the last couple months. We'll be demoing it in the talk tomorrow. I promise I'll announce it properly some time soon.


Well-Typed at ICFP

Wednesday, 24 September 2008, by Duncan Coutts.
Filed under community.

ICFP is great fun as usual. There have been some good talks and there's a packed schedule for the next few days. Of course the most important business takes place during the coffee breaks and in the pubs in the evenings.

One thing that feels pretty new and exciting this year is the number of Haskell hackers going professional and independent. Apart from Well-Typed of course there's also the Tupil guys doing web development and Conal Elliott has himself up as a consultant doing functional graphics.

Thursday will be the Haskell Symposium. Don Stewart and I will be giving a short about the Haskell Platform. You can see the two page summary paper and I promise I'll post the slides. I hope it will be videoed (guerrilla style so it will not just disappear into the black hole that is the ACM digital library).

Don and I will also be chairing the "Future of Haskell" discussion on Thursday afternoon. There are some interesting things to talk about, we'll probably hear about progress on goals from last year, in particular SMP and Haskell-prime. Then we'll open discussion to the floor to talk about key goals for next year. So I'm looking forward to that.


The new haskell.org community SPARC server is online

Tuesday, 02 September 2008, by Duncan Coutts.
Filed under community.

As I'm sure you remember, as part of the Haskell OpenSPARC project, Sun has donated a SPARC Enterprise T5120 Server to the Haskell.org community.

Björn took delivery of the server a month or so ago and now that the IT folk at Chalmers are back from holiday it's installed in the server room. It's just been given a public IP and I've been able to log in for the first time. We've now got to get software configured and get ghc working, so don't ask for accounts just yet.

Björn took some photos as he was getting it set up:

Photo of the inside of a T5120 server, showing the heatsink and all the memory sticks

So that's what 32GB of memory looks like! Under that little heatsink is the T2 CPU with its 8 cores, each core multiplexing 8 threads.

A T5120 with the case off showing all the components

It's a 1U form factor so it uses lots of little fans which you can see at the left and clear plastic ducts to channel the airflow over the memory and the CPU heatsink. In the bottom right you can see the dual power supplies.

Remember, if you want to hack on this project for three months, the closing deadline for applications is Friday the 5th of September.

I should make it clear that although I am a consultant with Well-Typed and also the coordinator for the OpenSPARC project, the project is really nothing to do with Well-Typed. It's a joint project between Sun Microsystems and the Haskell.org community. I'm wearing my community hat for this one.


What's wrong with make?

Thursday, 21 August 2008, by Duncan Coutts.
Filed under coding.

Actually, make is really pretty good. Well, GNU make is pretty good. When used correctly it can rebuild large projects in the minimum number of steps, in parallel.

So what is wrong with it? I'm not complaining about the simple annoyances like insane use of tabs or being based on two inflexible programming languages (shell script and make macros). No, it's got two fundamental problems: it cannot enforce correct dependency tracking and its dependency language is insufficiently expressive for common tasks.

It is possible to write incorrect make rules

It's far too easy to write incorrect rules -- rules where the named dependencies and targets do not match up with the action:

foo : bar
    cat bar baz > foo

So obviously that's a totally contrived example and is easy to spot that it has an untracked dependency on baz, however it gets much harder with more complex rules and where the programs you call read or write files that you did not know about.

So the problem is that make cannot check that you've written a bogus rule and yet if you do your makefile is broken. What makes it worse is that it is not immediately obvious that the makefile is broken. It will work initially but you'll end up not trusting parallel make or running into quirks that are solved by make clean. If you have to use make clean your makefile is broken.

The reason it cannot check for correctness of rules is of course because the actions are opaque blobs of shell script. The decision to use shell script made it quicker to implement make and users were already familiar with the language but the cost is that makefiles are more or less impossible to check automatically.

Dynamic dependencies cannot be expressed sanely in make

Make takes the view that there is a single massive static dependency graph. The single and massive aspects are not a problem but the static one is. The problem is that in many cases you do actually need to run some build actions before you know all the dependencies of each file in a project. The canonical example is running a tool to discover the dependencies of .hs or .c files.

GNU make has a clever extension which makes it possible to express this kind of dependency in simple cases. It lets one include deps.mk. That reads in the deps.mk file and extends the dependency graph with all the rules found in deps.mk. The clever thing is that deps.mk itself can depend on other things and can have a rule to generate it. For the example of dependencies of .hs or .c files, this lets us write a rule to call the compiler with a flag to dump make-format dependencies into a file. This works quite nicely, in this simple case.

The problem with doing include deps.mk is that the file can specify any rules at all. It can add edges between any existing nodes in make's dependency graph. That means that make cannot be confident that it knows everything unless it has got deps.mk up to date and included it. So before anything else can be done, make has to update deps.mk, even if it turns out in the end that the deps given in the include file are irrelevant to the task at hand.

In practise we do not need to be able to add edges willy nilly all over the place and there is a great benefit to being more disciplined. Suppose for the sake of argument that for each .hs file I have a .deps file listing the files that the .hs file depends on. Now if I am trying to update foo.hs then I only need to look at the suffix rule for .hs files and the rules in foo.deps. I know that I do not need to look in bar.deps (unless I discover from foo.deps that foo.hs depends on bar.hi). This is a massive saving. It means I do not have to bring all the .deps files up to date. That also means I do not have to bring the corresponding .hs files up to date. That also means I do not have to run any pre-processors or code generators for those .hs files. That also means I do not have to compile the code generators.

You might think this is a contrived situation but it is not. The Gtk2Hs build system use GNU make (non-recursive of course). It has lots of .chs.pp files which get pre-processed with cpp and then c2hs to generate a .hs file. For each .chs file we look for all the imports and generate a .deps file which gets included into the top level makefile. So just to calculate all the dependencies we actually have to two about two thirds of the build. That's because we first have to build c2hs and two other code generators and then run cpp and c2hs on a hundred or so files, just so we can generate all the .hs files and find their dependencies. This means that from a clean state I cannot build just one particular .hs file because the very makefile itself depends on all the other .hs files. Amusingly, just to make clean, make has to build half the project. Obviously that is insane so we have to hack it using includes that are conditional on the make target. That's not a technique that scales however.

In any non-trivial build system it is necessary to interleave build actions with discovering dependencies. Classic make ignores this fact completely. GNU make notes it but forces us to do all the actions to discover the full dependency graph up front before we can get round to useful work. A better build system would be built around the notion of interleaving graph exploration with graph reduction. Saizan's make-like framework is built around this principle.

Could we do better?

Looking around at previous work, like the classic recursive make considered harmful or the Vesta or Maak build systems, one lesson comes up again and again: track dependencies absolutely precisely or you lose.

I am convinced that the solution to the first problem is not to specify the dependencies and targets separately from the action. They should be specified together so that there is simply no way to express an untracked dependency. This is also the approach taken by Vesta and Maak.

If we have a bunch of primitive actions that correctly declare their dependencies and ways of combining actions that remembers dependencies then we should never end up with untracked dependencies. For example consider the above example again. Lets suppose we have two primitives:

readFile  :: FilePath -> Rule String
writeFile :: FilePath -> Rule (String -> ())

Clearly one has a single dependency and the other a single target. Lets write that contrived rule again:

rule = write <*> read
  where
    write = writeFile "foo"
    read  = (++) <$> readFile "bar" <*> readFile "baz"

So the intuition here is that applicative composition gives us static rule-graph composition. As for rules with dynamic dependencies, this corresponds to monadic composition:

(<**>) :: Rule a -> Rule (a -> b) -> Rule b
(>>=)  :: Rule a -> (a -> Rule b) -> Rule b

This lets us express the cases where very structure of the dependency graph depends on a value calculated from another part of the graph.

With static composition we can look at the resulting rule and 'see' down both branches of the composition to discover the graph structure without running any actions. With dynamic composition we can look at the left branch statically but the right branch requires the value produced by the left before we can even discover its structure.

So we should use static composition where possible because it tells us more about the graph which gives us more opportunities for parallelism and for caching intermediate results to avoid needless recompilation. Where necessary we can use dynamic dependencies but we should not over-use them or our dependency 'graph' would just become one long chain.

So, that's the idea. I should make clear: there is no corresponding implementation! Unlike my vaporware, Saizan has a working implementation of a slightly lower level API as part of his GSoC project. We have talked about putting a higher level API on top though it's not completely clear how to do that yet.

There are also undoubtedly connections to FRP that could do with further exploration. I mean it is quite clearly an FRP style problem, but an ugly grungy one where many parts work by side effects in the file system.


Solving the diamond dependency problem

Tuesday, 19 August 2008, by Duncan Coutts.
Filed under coding.

In a previous post I explained the the dreaded diamond dependency problem. In this post I'll look at a solution.

Ever seen an error like when building a package?

Couldn't match expected type
  `bytestring-0.9.0.4:Data.ByteString.ByteString'
  against inferred type `ByteString'

Or when configuring a package?

Warning: This package indirectly depends on multiple versions
of the same package. This is highly likely to cause a compile failure.
package pureMD5-0.1.2 requires bytestring-0.9.0.4
package binary-0.4.1 requires bytestring-0.9.1.0
package cabal2arch-0.3 requires bytestring-0.9.1.0

These failures were pretty common.

So the good news is that the latest release does solve the problem. It contains a completely new package dependency resolution algorithm based on constraint solving.

The problem space is this: we've got a bunch of packages, say A, B, C, D.
Diamond dependency graph

We've got a bunch of versions of of those packages, let's say:

Each package version has dependencies on other packages along with version constraints, for example A-1.0 might need C >= 1 and A-1.1 might need C >= 2. Then given a goal like A >= 1, we have to pick exactly one version of package A and one version of C and so on recursively until we have picked a package version to satisfy each dependency. So even if we can get to a package by following two different dependency paths, we must end up picking the same version. That's what it means to avoids inconsistent dependencies. Of course the version we pick for a package has to satisfy the version constraints of all the packages that depend on it.

The problem space can actually get extremely large. In theory each choice about a package version is independent. In the above example supposing we have to pick a version of A, B, C and D then we have two choices for A, one choice for B and two choices for C and D which gives us 2*1*2*2 combinations overall. It is clear that we cannot explore the entire space of combinations because it's exponential in size. That may not seem important for this example with just 4 packages but there are over 700 packages on hackage and we'd like to be able to install most of them at once. We need to apply some smarts.

Note that this is actually an NP-complete problem. It's possible to encode 3-SAT into the package dependency problem. Fortunately most real instances are fairly easy so we need not give up hope.

Ultimately we have to make choices between versions of a package and we have to end up with a consistent set of choices (or fail). It makes sense therefore to keep track of which versions of which packages are still valid choices. We can make the problem easier if we can eliminate choices that we know would lead to impossible or inconsistent solutions. We can propagate information both up and down the dependency graph to help us eliminate choices.

For example supposing we looked at D-1.0 and D-1.1 and we discovered that D-1.0 could not possibly be built on our operating system but D-1.1 could. If later on we discover we need some version of D then we only have one possibility so we can pick it right away. More interestingly, it might also allow us to eliminate a choice higher up. Supposing for the sake of argument that C-1.0 needed D-1.0 and would not work with D-1.1 then by eliminating D-1.0 we can also eliminate C-1.0.

In the downwards direction, consider the effect of picking A-1.1 rather than A-1.0. We said earlier that A-1.0 needs C >= 1 and A-1.1 might need C >= 2. So making this choice for A eliminates C-1.0.

One class of strategies is to make a series of firm choices and try to build up to a consistent solution. If we do this with no backtracking then we will always run in polynomial time, but there will always be cases where we cannot find solutions even though a solution exists. The chance of us finding a solution in this kind of strategy depends on the order in which we make the choices.

The algorithm

The strategy I've adopted is to make package choices in a top down order. It is a kind of mixture between breadth first and depth first search. We have a set of packages that we know we will need and a set of specific package versions we've already picked. We also maintain a set of constraints on all the packages. This lets us know what versions are available for each package that we've yet to decide about.

In each round we pick a package and choose one version of it and update our set of constraints and if the package we picked depends on a new package then we add that to the pending package set. When there is a package we know we need and there is only one version left then we pick it. That's the easy case.

The hard case is when all remaining packages have at least two versions we could pick. Here we just have to pick something and hope for the best. First of all, we have to decide which package we're going to make a choice for. A good heuristic here is to decide on packages higher up the dependency graph earlier since their dependencies impose constraints on later choices. We can do this by caching the depth first numbering for each package. Once we've decided on which package to go for then we have to decide which version to pick. The most sensible thing is to pick the highest version, however there is a complication with installed packages that I'll come back to later.

Results

So this algorithm produces pretty good results in most cases. It always runs in polynomial time and because of the way it gathers constraints it is guaranteed that if it finds a solution that it will be a valid consistent solution. The old dep solver which would very often come up with inconsistent solutions. Remember: consistent solutions avoid the diamond dependency problem.

Another good thing to gathering constraints is that when we fail we have some information about why we failed because we know which additional constraint made the constraint set inconsistent. We can translate that into a human readable error message, though not necessarily a particularly brilliant one.

By manually adding additional constraints we can also guide the solver to find solutions it would not otherwise find, or to avoid failure. In a sense this is just a hack to work around the fact that the solver isn't even better, but in the mean time it's pretty handy.

You can play around and get a sense of what the algorithm is doing if you run cabal install --dry-run -v pkg1 pkg2 etc and look at the intermediate output and the final solution.

The biggest test I've tried so far is trying to install all of hackage simultaniously. It takes several seconds to find a solution, which is not too bad. I had to discard about 30 packages to arrive at a consistent set. Adding any one of the discarded ones back leads to inconsitent dependencies, for example conflicts over which version of HaXml is required. There are also many packages that simply cannot ever be installed because they have missing dependencies.

Installed packages

So what about installed packages? I mentioned that there's a complication there. We often have to decide between picking an already installed version of a package and the available one from hackage (even if they are the same version). If we pick the installed version then we immediately pick up a lot of precise constraints, because the installed package depends on exact versions. That is often enough to prevent us from finding a solution. Indeed in the diamond dependency examples I gave previously it was clear that to find a consistent solution we would have to rebuild some existing installed package. So we cannot default to preferring installed packages. A simple observation is that the available package always has looser constraints than the installed version, so it's always safe to pick the available version over the installed version. We can always 'improve' our solution at the end if we find the constraints we end up with are consistent with the constraints of the installed version. And that is what we do. There is a further issue however. If ultimately we would like the option to use the installed version then we had better pick the same version of the available package and not just the highest available version. What we do is allow a per-package preference for the highest available version or the highest available version that has a corresponding installed version. For the 'install' command we prefer the latest version for the packages explicitly given and the installed version for all others. For the 'upgrade' command we prefer the latest version everywhere.

That seems to work pretty well, though I've found people do get somewhat confused when cabal-install decides to re-install a package they already have. What is not clear to them is that it is rebuilding a package against different versions of its dependencies. Probably the thing to do is to make that clearer in the output of cabal install --dry-run.

Improvements

There are still plenty of improvements that could be made. The algorithm currently does not deal well with installing packages for older compilers because it finds out too late that a package it choose could only work with a newer version of the base library. The solution there is some bottom up constraint propagation. We should check at the beginning that each package version could possibly be installed with the current compiler and if not eliminate it. There are still real cases where we end up with an avoidable conflict. For example we might have two packages that need HaXml, lets say Foo needs HaXml =1.13.* while there are two versions of Bar, one that works with HaXml =1.13.* and a later version that needs HaXml >= 1.16. If we end up picking Bar before Foo then we would default to the later version of HaXml but then fail when we find Foo needs the earlier version. This is not solved by the ordering heuristic because Foo and Bar do not depend on each other. This is an example where we need either more look ahead, backtracking or a much more sophisticated algorithm in general.

The right long term approach I think is a proper complete constraint solving algorithm.


Previous entries

Next entries