Parallel Haskell Digest 7

Eric Kow – Saturday, 24 December 2011

GHC 7.4 is coming! There is loads to look forward to, but sometimes, it's the little things that count. For example, do you hate the fact that you can't just flip on an +RTS -N without having to first recompile your program, this time remembering to throw an -rtsopts on it? Duncan Coutts has relaxed the requirement so that commonly used RTS options can be used without it. This flag was originally implemented to counter security problems for CGI or setuid programs; however, it was also a hassle for regular users because it got in the way of common options like -eventlog, -N, or -prof. The GHC 7.4 RTS will make a better tradeoff between security and convenience, allowing a common set of benign flags without needing -rtsopts.

That's the sort of thing that the Parallel GHC Project is about. We want to push parallel Haskell out into the real world, first by helping real users (our ~~guinea pigs~~ industrial partners) to apply it to their work, second by making it easier to use (tools, libraries), and finally communicating more about it (this digest).

In this month's digest, we'll be catching up on news from the community. After the holidays, we'll be back with some new words of the month exploring a bit of concurrent Haskell. In the meantime, happy hacking and Merry Christmas!

News

Job Opportunity at Parallel Scientific

Peter Braam wants you, parallel Haskeller!

Parallel Scientific, LLC is a Boulder, CO based early stage, but funded startup company working in the area of scalable parallelization for scientific and large data computing. We are implementing radically new software tools for the creation and optimization of parallel programs benefiting applications and leveraging modern systems architecture. We build on our mathematical knowledge, cutting edge programming languages and our understanding of systems software and hardware. We are currently working with the Haskell development team and major HPC laboratories world wide on libraries and compiler extensions for parallel programming.

Note the mandatory Haskell experience and the desirability of “in depth knowledge of core Haskell libraries for parallel programming (NDP, REPA etc)”.

Parallel GHC Project Update

The Parallel GHC Project is an MSR-funded project, run by Well-Typed, with the aim of demonstrating that parallel Haskell can be employed successfully in real world projects.

Our most recent work has been in polishing the upcoming ThreadScope release that we previewed this September at the Haskell Implementor's Workshop. This new release comes with goodies for users of Strategies or the basic par/pseq parallelism: spark creation/conversion graphs, visualisations showing your spark pools filling and emptying, and histograms displaying the distribution of spark sizes. All this with the aim of helping you gain deeper insight, not just what your program is doing but why.

We've also done backend work to make ThreadScope even more useful further down the road. First, we have improved the ghc-events package by encoding the meanings of events in state machines. This makes it possible to validate eventlogs, and doubles as an always up-to-date source of code as documentation. Second, we have extended the GHC RTS to emit the startup wall-clock time and Haskell threads labels to the eventlog. The wall-clock time event allows us to synchronise logs for simultaneous processes, brining us a step closer to using ThreadScope on distributed programs. Named Haskell thread make it easier to distinguish threads from each other.

Finally, we have been exploring the use of Cloud Haskell for high performance computing on clusters. To do this we would need to abstract Cloud Haskell over different transport mechanisms, that is to develop a robust Cloud Haskell implementation sitting on top of a swappable transport layer. We have posted an initial design for this layer on the parallel-haskell list. We have taken the substantial feedback into consideration and will be sending a revised design and recording it in a page on the GHC wiki. Meanwhile, we are working to further validate our design on simple models of both the transport layer and a cloud Haskell layer on top. Longer term, we aim to implement some transports, an IP transport in particular and perhaps a single-node multi-process transport using forks and pipes.

Tutorials and Papers

Tutorial: Deterministic Parallel Programming in Haskell (7 Oct)

Well-Typed's Andres Löh presented a parallel programming tutorial at the recent Haskell in Leipzig meeting. The tutorial comes with slides, exercises, sample code. It paints a picture of the parallel Haskell landscape, and then focuses on one of the many possible approaches (namely, strategies). One nice feature of the tutorial is an emphasis on practicalities, for example, on using ThreadScope to figure out where performance goes wrong in a program. So if you're looking for a way to get started using on parallelism to speed up your Haskell code, give Andres' tutorial a try!

Parallel Genome Assembly with Software Transactional Memory (27 Oct)

Ketil Malde wrote up some of his experiences using STM to parallelise an inherently complicated program best solved with multiple interacting threads. His article demonstrates that a program using STM is able to successfully parallelize the genome scaffolding process with a near linear speedup. Ketil would be interested in any feedback the community may have.

Blogs and Packages

Actors, actors everywhere

remote: Cloud Haskell is here! (27 Oct)

You may have been hearing a lot about Cloud Haskell lately, the new Erlang-ish distributed programming library for Haskell. Now's your chance to see what all the fuss is about! Jeff Epstein has uploaded the remote package to Hackage, so take it for a spin by doing
```
cabal update
cabal install remote
```
Library documentation is on the Hackage page, and more details are available in the paper Towards Haskell in the Cloud

Distributed storage in Haskell (30 Oct)

So what are people doing with Cloud Haskell? Julian Porter for one has been working on a distributed monadic MapReduce implementation. Along the way he's produced a general proof of concept for distributed storage. Have a look at Julian's page for a short paper and GitHub page.

simple-actors 0.1.0 released (11 Oct)

Brandon Simmons accounced simple-actors, an EDSL-style library for writing more structured concurrent programs, based on the Actor Model. It was designed for local concurrency, as an alternative to ad-hoc use of Chans, but could be extended to a distributed system by defining appropriate SplitChan instances for some network "channel".

Haskell Actors (28 Oct)

Martin Sulzmann wishes he'd named his actor package “multi-headed-actor”. With the recent interest in actor style concurrency in Haskell, there may be some confusion about the various packages that are out there. The point in Martin's library is being to pattern match over multiple events in the message queue, which makes it easier elegantly express ideas like a marketplace actor which matchmakes buyer/seller messages. While Martin's library is built on concurrent channels, it could be adapted to use distributed channels provided by haskell-mpi or Cloud Haskell. See the paper for more information Actors with Multi-Headed Message Receive Patterns.

More concurrency

stm-stats: Retry statistics for STM transaction (9 Oct)

Joachim Breitner blogged about the stm-stats package, which provides wrappers around atomically to track how often a transaction was initiated and how often it was retried. The stm-stats library is used interally by Factis research, but recently released to the wider Haskell community. In fact, Factis have recently hired Joachim to help them contribute back to the Free Software community where possible. So, thanks, Factis and congratulations, Joachim!

How to deal with concurrent external events? (11 Oct)

Apfelmus has been scratching his head over a design problem for event-based frameworks such as GUI libraries: how do you deal with events that occur while you are currently handling another event? Apfelmus gave a simple wxHaskell demonstrator illustrating the problem, (A) reacting to an event while handling another one may expose internal invariants but (B) reacting to an event after finishing another one may render it “impossible”, i.e. it should not have happened in the first place. Any thoughts on the dilema?

Concurrency And Foreign Functions In The Glasgow Haskell Compiler (24 Oct)

Leon P. Smith posted an overview of the interaction between Haskell concurrency and FFI calls in GHC. Leon's post walks us through some the basic concepts: capabilities, Haskell threads, OS threads, and bound threads. This could be good place to start before delving into papers or library documentation.

iteratee-stm (4 Nov)

John Lato announced the new iteratee-stm library recently uploaded to Hackage. Iteratee-stm provides an iteratee interface that uses bounded TChans for communication. This makes it simple to run IO in a separate thread from processing.

Parallelism

Automatic deparallelization (17 Nov)

Ken Takusagawa explored a different perspective on parallelism. Instead of adding parallelism to programs, what if we started with too much parallelism and stripped it away to fit reality?

Consider always writing code in a style using egregious fine grained parallelism: assume lots of cores with no communication latency and no overhead. It is the compiler's job to deparallelize (unparallelize, serialize) the program to run on the actual number of cores available, taking into account communication latency and the overhead of parallelization

Oh, and [qkhsskbg]

Introducing Speculation (22 Jul 2010)

Recently, I got a chance to catch up with Edward Kmett, getting my mind twisted into delightful funny shapes in the process. Edward mentioned his speculation library, yet more parallelism in Haskell! The library is based on the paper Safe Programmable Speculative Parallelism by Prakash Prabhu et al. It provides a way to parallelise inherently sequential algorithms (eg. lexing, Huffman decoding) by guessing the value of intermediate results. You start working in parallel to build work off the guess, only discarding it if the guess turns out to be wrong later on. Check out Edward's blog and slides for more details.

Quasicrystals as sums of waves in the plane (24 Oct)

Keegan McAllister posted an somewhat hypnotic animation of quasicrystals. His post comes with complete source code for his program using the Repa parallel arrays library. Repa was useful to Keegan because it provides
- Immutable arrays, supporting clean, expressive code
- A fast implementation, including automatic parallelization
- Easy output to image files, via repa-devil

Simple library for CAS posted (7 Dec)

Ryan Newton released IORefCAS, which provides a drop-in replacement for atomicModifyIORef that takes advantage of the new casMutVar# primop from GHC 7.2. Ryan says that “[b]ecause it's an easy change it might be worth trying that for hot IOrefs in your parallel app.”
OpenCL 10.2.2 (23 Nov)

Luis Cabellos has updated the Haskell OpenCL package with better documentation and improved error handling using Control.Exception instead of Either error.

Mailing list discussions

Help wanted

Alternative STM implementation (10 Dec)

Daniel Waterworth shared his alternative STM implementation, written as a learning exercise. Could anybody provide Daniel with some criticism and pointers to STM benchmarks for comparison? His replacement should drop right in if you only use TVars.

Parallel Matrix Multiplication (10 Dec)

Mukesh Tiwari is trying to teach himself parallel Haskell (welcome!). He's gone through Real World Haskell and the tutorial by Simon Peyton-Jones and Satnam Singh, but now trying to implement a parallel matrix multiplication function, he finds himself with no sparks converted. Can anybody give Mukesh a hand?

Mukesh also asked about resources for Parallel Haskell, which would be where I come in. Mukesh, have a look at the parallel Haskell portal: http://www.haskell.org/haskellwiki/Parallel

Cloud Haskell

Cloud Haskell now on Hackage (27 Oct)

Jeff Epstein's announcement that he had uploaded “remote” to Hackage was greeted with joy and a somewhat lengthy discussion on package/module naming. It looks like the modules will be moved from 'Remote' to 'Control.Distributed.Actor' or 'Control.Distributed.Process' to match the approach used for the concurrency packages. The final package name seems to be distributed-process.

Haskell Cloud and Closures (1 Oct)

Fred Smith gave Cloud Haskell a try, using it to remotely compute the plus function. Now he wants to be able to send a function to a remote host, no matter if the function is locally declared or at the top level. Erik de Castro Lopo replied that this was a known limitation with the only known workaround being to move the required function to the top-level. Chris Smith pointed out that while the current restrictions may be too tight, there is good reason to have them. As for alternatives approaches to serialising functions, David Barbour suggested maybe looking at the tangible values work by Conal Elliot.

Feedback on Cloud Haskell transport layer interface (2 Nov)

As I mentioned in the Parallel GHC Project update, we've been looking quite a bit into Cloud Haskell lately. Duncan Coutts posted a request for feedback on the design for a Cloud Haskell transport layer interface. We're hoping one day to make use of Cloud Haskell on for high performance computing on clusters. To do this, we hope to develop a robust Cloud Haskell implementation sitting on top of a swappable transport layer, for example, an IP transport, or a single-node multi-process transport using forks and pipes.

One issues that emerged from the discussion is how to deal with potentially a plethora of paramaters (eg. buffered vs eager? ordered? reliable?) associated with connection/endpoint creation. It doesn't help that each connection type may have its own set of parameters. Is it enough to be able to set and forget them during transport session initialisation, or is it essential for Cloud Haskell be able to set these parameters differently for different connections in the same session?

Parallel Haskell in industry (7 Nov)

Sébastien Lannez also got a chance to try out Cloud Haskell. The remote package uploaded by Jeff seems to work well and — dabblers take note — the examples shipped with the code are very easy to adapt. Before digging deeper, Sébastien wanted to know more about
1. performance limitations
2. communication requirements/overheads
3. stability
4. already developed applications
Jeff cautioned that while he thinks Cloud Haskell could be a good platform to develop distributed applications, it's still very much research software and a work in progress. Don't stake your company on Cloud Haskell just yet.

That said, Duncan Coutts added, we are pretty happy with the design and optimistic about developing a robust implementation, because we can build it as an ordinary Haskell library without requiring tricky extensions to the runtime system. As for Sébastien's fourth question, a couple of Parallel GHC Project partners are rather keen on Cloud Haskell. We are working on the implementation and will hopefully have more to report on performance, overheads and other issues we encounter.

Multicore performance

SMP parallelism increasing GC time dramatically (5 Oct)

It takes a village to tune a program. Tom Thorne has a program with a function does some fairly intensive calculations on with hmatrix. When Tom tries to get some simple parallelism on his 12 core machine, replacing a map with a parMap rdeepseq, he finds GC time going through the roof, from 1s (1.7%) to 248s (40.5%). Is the big scary number just an artefact of how GC time is reported, or is something really wrong?

ThreadScope is a good first response here and Tom was duly nagged by the community. Tom promises to give it a go, although the last time he tried, the event log output produced about 1.8GB, and then crashed. The ThreadScope team would love to get hold of any hints about reproducing the crash.

Ryan Newton observed that GC aside, the program does not appear to be scaling; the mutator time itself isn't going down with parallelism. Tom improved the parallelism a bit, breaking the work into chunks and spreading it around more evenly, and provided he disables the parallel GC, it turns much faster and outperforms the sequential version. Having loads of RAM to play and code that doesn't use much memory, Tom then tried telling the RTS to perform GC less often. This worked. Increasing the minimum allocation area size from its default 512K with +RTS -A32M allows Tom to get performance with the parallel GC comparable to that without. Hooray! But there's still this little problem… now Tom's program intermittently segfaults. Getting a bug report out of this may take a while though as Tom attempts to boil it down.

Meanwhile, Oliver Batchelor offered his experience that enabling more threads than he has cores makes his program get drastically slower. Alexander Kjeldaas and Austin Seipp commented that this is due to GC needing to co-ordinate with blocked threads, and that the problem of oversaturating is well known. There's also the "dreaded last core slowdown" bug which once affected Linux users but seems to have gone away in recent Linux/GHC.

AMD Bulldozer modules and Haskell parallelism (13 Oct)

Herbert Valerio Riedel has been eyeing the AMD FX-8120 Bulldozer processor. Bulldozer cores are not independent from each other, but grouped into pairs. So Herbert wanted to know how this might affect Haskell parallelism; would 8 cores really mean 8 or just 4 with slightly better SMT capability? Simon Marlow does not know (benchmarks). Duncan Coutts believes that it should be all fine as the pairing is not at all like hyperthreading.

Estimating contention on an IORef hammered with atomicModifyIORef (27 Oct)

Ryan Newton starts us off with a hypothetical scaling bottleneck: all threads frequently accessing a single IORef using atomicModifyIORef (Data.IORef). This is commonly understood to be likely a bad idea, but how do we go about measuring just how bad it is? This sort of design appears in monad-par, as pointed out by Johan Tibbell, in the GHC IO manager, so it would be good to know how much it really hurts. (See also Ryan's IORefCAS package which seems to be partly a result of this discussion)

One approach is to use GHC events to count operations on particular IORefs, then put that through a model that reports whether if the IORef is being used acceptably, or is "hot". Duncan Coutts suggests a simple way to get partway there: stick something like a traceEvent "IORef #3" on each use of atomicModifyIORef and do something like a ghc-events show | grep IORef to at least get an idea which IORefs are hotter than others and some orders of magnitude. We'll hear back from Ryan when he's had a chance to try it.

Also for the interested, it's worth mentioning that GHC 7.4 will be sporting a new and improved traceEvent, this time exported through Debug.Trace and offering versions for use in pure code and IO both.

Way to expose BLACKHOLES through an API? (7 Nov)

A BLACKHOLE in GHC acts as a placeholder for a thunk that is currently being evaluated. When the thunk is forced, GHC replaces it with a BLACKHOLE object, which it later replaces when it has the evaluation result. In a parallel/concurrent setting, it may happen that two threads are trying to evaluate the same thunk at the same time. In that case, the first thread creates the blackhole, which the second thread notices and blocks on until the evaluation result is available.

Ryan Newton observes that this blocking is implicit, whereas “[w]hen implementing certain concurrent systems-level software in Haskell it is good to be aware of all potentially blocking operations”. He proposes a mechanism to expose blackholes, for example with a evaluateNonblocking :: a -> IO (Maybe a) that returns Nothing if the value is blackholed. Simon Marlow points out that this may be slightly problematic as thunks depend on each other and “you might be a long way into evaluating the argument and have accumulated a deep stack before you encounter the BLACKHOLE” See the discussion for a counter-proposal.

Data structures and concurrency

Efficient mutable arrays in STM (25 Oct)

Ben Franksen has large arrays (millions of elements) with mostly small elements (Int or Double) and largely chunk-wise access patterns. The current implementation of Control.Concurrent.STM.TArray as Array ix (TVar e) is not nearly efficient enough for his use case. A more efficient implementation would be most welcome, but for now Ben is eyeing Data.Vector.Unboxed from the vector package instead. The idea is to use unsafeIOToSTM to provide shared transactional access to his arrays. Ben thinks he can live with the consequences: IO code being rerun, aborting, and inconsistent views.

But does the STM transaction actually "see" that he changed part of the underlying array so that the transaction is retried? If not, how does he go about manually implementing this behaviour? Antoine Latter reports that no unsafeIOToSTM is not transactional - IO actions will be performed immediately and are not rolled back, and are then re-performed on retry. David Barbour and Ketil Malde suggested possible implementations, either keeping an extra TVar Int for every chunk in the array, or (B) cleaner and safer: create a “chunked” TArray that works with fixed-width immutable chunks in a spine.

Another issue that came up is that transactions scale quadratically with the number of TVars touched. Bryan O'Sullivan and Ryan Ingram explained that this is due to choice of data structure (a list) for the STM transaction log, and should be easy to fix.

High performance threadsafe mutable data structures in Haskell? (27 Oct)

Ryan Newton wanted to know if anybody else was working on threadsafe mutable data structures in Haskell. He and the monad-par team were planning to replace their work stealing deques with something more efficient. If anybody else is working in the same general area, teaming up would be great!

Ryan will be exploring both a pure Haskell approach and one based on wrapping foreign data structures with the FFI. Ultimately, Ryan is aiming for an "abstract-deque" parameterizable interface that abstracts over many variants (bounded/unbounded, concurrent/non-concurrent, single/1.5/double-ended, etc). His current prototype makes use of phantom types and the type families extension to handle all this abstraction, with the intended end result being that someone can create a new queue by setting all the switches on the type (eg. q :: Deque NT T SingleEnd SingleEnd Bound Safe Int <- newQ), but this brings up a set of Haskell language and type system questions. More details in the thread!

Persistent Concurrent Data Structures (1 Nov)

Like Ryan, Dmitri Kondratiev is interested in concurrent mutable data structures, but this time with persistence to boot. His goal is to program at a higher level of abstraction, avoiding the detail bloat that would result from directly using some data storage API (eg. SimpleDB). Dmitri's idea: a module tree of data structures mirroring Data.List, Data.Map, etc but with concurrency and persistence. One would be able to configure through the type interfaces:
1. media to persist data (file? DBMS?)
2. caching policy
3. concurrency configuration (optimistic/pessimistic locking?).
Dmitri's post prompted some suggestions for packages to look into:
- safecopy: addresses both the issues of serializing the data and migrating it when the datastructure changes
- acid-state: builds on top of safecopy to add a notion of transactions to any Haskell data structure
- TCache: a transactional cache with configurable persistence
- Haskell web server frameworks (eg. Yesod, Happstack [acid-state was formerly happstack-state]), as some come with persistence support
Jeremy Shaw and David Barbour had reservations about what Dmitri had in mind when he said "concurrent". How would he deal with transaction boundaries, and would a concurrently modified Data.List variant still be a list? Evan Laforge also expressed skepticism about the viability of abstracting over data stores with potentially very different needs.

Threads, blocking

Waiting on input with hWaitForInput' orthreadWaitRead' (17 Oct)

Jason Dusek would like to use evented I/O for a proxying application, in particular, to fork a thread for each new connection and then to wait for data on either socket in this thread, writing to one or the other socket as needed. He's found two functions which could help, System.IO.hWaitForInput and Control.Concurrent.threadWaitRead but each comes with some difficulties. Is there something like select() that works with handles rather than file descriptors?

Ertugrul Soeylemez suggested an alternative approach, just plain Concurrent Haskell because “[a] hundred Haskell threads reading from Handles are translated to one or more OS threads using whatever polling mechanism (select(), poll(), epoll) your operating system supports”. He pasted a small echo server to demonstrate the idea. It wasn't entirely clear for Jason how to apply this to a proxy server. Jason has a lazyBridge :: Handle -> Handle -> IO () which writes everything it reads from one handle into the other and vice-versa, but it blocks and does not allow packets to go back and forth. Gregory Collins sketched out a possible solution: how about forkIOing two threads (one for the read end, one for the write end), with a loop over lazy I/O? This works, but is still somewhat surprising.

System calls and Haskell threads (3 Nov)

Andreas Voellmy noticed this in Kazu Yamamoto's Monad Reader article on a high performance web server.

When a user thread issues a system call, a context switch occurs. This means that all Haskell user threads stop, and instead the kernel is given the CPU time.

Can that be right? Andreas thought, and Johan Tibell confirms, that when a Haskell thread is blocking a particular OS threads, other Haskell threads can continue run concurrently on other OS threads on other CPUs (see Extending the Haskell Foreign Function Interface with Concurrency).

Further clarification comes from David Barbour, who points out why Kazu's original statement was correct in the context of the article. While Mighttpd uses Haskell threads for concurrency; it does not go the traditional route of using the RTS -Nx argument to generate OS threads. Instead it gets its parallelism from a "prefork" model that creates separate processes to balance user invocations (each process may itself be running multiple Haskell threads). This unusual approach is chosen to avoid issues with garbage collection.

Where threadSleep is defined? (6 Dec)

Dmitri Kondratiev was looking for a function to make the current process (executing thread) go to sleep for a given time. Felipe Almeida Lessa pointed to the threadDelay function in Control.Concurrent.

Stack Overflow and Reddit

Help and Feedback

If you'd like to make an announcement in the next Haskell Parallel Digest, then get in touch with me, Eric Kow, at parallel@well-typed.com. Please feel free to leave any comments and feedback!

Bikeshed image by banlon1964 available under a CC-NC-ND-2.0 license.