A Cloud Haskell Appetiser (Parallel Haskell Digest 11)

Sat, 21 Jul 2012 07:45:59 GMT, by eric.
Filed under ph-digest, parallel.

Hello Haskellers! We mentioned in the last digest that we'd have just a tiny bit more to say about Parallel Haskell. As promised, here is the completed word of month on actors and their use in Cloud Haskell. It so happens — what a coincidence! — that Well-Typed's Edsko de Vries has recently published a beta version of the new distributed-process implementation on Hackage. We'd love it if you could give it a try let us know any trouble you ran into or ways we could improve things. To help push things along a bit, this word of month will be using the new distributed-process implementation.

Also, have you had a chance to fill out the Parallel Haskell Digest Survey? It's collecting data for another couple of weeks. Anything you can tell us in the survey will inform future efforts in building the Haskell community, so if you've got a couple of minutes before Cloud Haskell Time, head over to

Parallel Haskell Digest Survey

Many thanks!

Word of the month

The word of the month series has given us a chance to survey the arsenal of Haskell parallelism and concurrency constructs:

The Haskell approach has been to explicitly recognise the vastness of the parallelism/concurrency space, in other words, to provide a multitude of right tools for a multitude of right jobs. Better still, the tools we have are largely interoperable, should we find ourselves with jobs that don't neatly fit into a single category.

The Haskell of 2012 may be in a great place for parallelism and concurrency, but don't think this is the end of the story! What we've seen so far is only a snapshot of the technology as it hurtles through the twenty-tens (How quaint are we, Future Haskeller?). While we can't say what exactly the future will bring, we can look at one of the directions that Haskell might branch into in the coming decade. The series so far has focused on things you might do with a single computer, using parallelism to speed up your software, or using concurrency abstractions to preserve your sanity in the face of non-determinism. But now what if you have more than one computer?

Actors

Our final word of the month is actor. Actors are not specific to distributed programming; they are really more of a low level concurrency abstraction on a par with threads. And they certainly aren't new either. The actor model has been around since the 70s at least, and has been seriously used for distributed programming since the late 80s with Erlang. So what makes an actor an actor? Let's compare with threads

Actor Thread
can create more actors can create more threads
can have private local state can have private local state
has NO shared state (isolated from other actors!) has limited shared state
communicates with other actors via asynchronous message passing communicates with other threads via shared variables

The essential difference between actors and threads is the isolation and message passing. There aren't any holes punched into lids here, but you can always shine a message from one jam jar to another, perhaps hoping they send you one of their own. The appeal of actors is thus a kind of simplicity, where avoiding shared state eliminates a class of concurrency bugs by definition, and where each actor can be reasoned about in isolation of its brethren.

This sort of thing may perhaps strike a chord with us functional programmers, and actually, there is quite a bit of actor-related work in Haskell: a handful of packages offering the actor as concurrency primitive, Martin Sulzmann's multi-headed twist on the model; Communicating Haskell Processes exploring an actor-ish cousin known as CSP. Finally, there's Cloud Haskell, which in explicit homage to Erlang, applies the actor model to distributed programming.

Glimpse of Cloud Haskell

We'll be taking a quick look at Cloud Haskell in this word of the month, unfortunately with only the most fleeting of glimpses. If squirting money between bank accounts is the transactional hello world, playing ping pong must surely be its distributed counterpart. Before working up to that, we first start with half a hello. The following example creates three processes — “process” is the Erlang-inspired word for the actor here — one which receives Ping messages and just prints them to screen, one which sends a single Ping message, and finally one which fires up the first two processes:

{-# LANGUAGE DeriveDataTypeable #-}
module Main where
import Control.Concurrent ( threadDelay )
import Data.Binary
import Data.Typeable
import Control.Distributed.Process
import Control.Distributed.Process.Node
import Network.Transport.TCP
-- Serializable (= Binary + Typeable)
data Ping = Ping deriving (Typeable)
instance Binary Ping where
    put Ping = putWord8 0
    get      = do { getWord8; return Ping }
server :: ReceivePort Ping -> Process ()
server rPing = do
    Ping <- receiveChan rPing
    liftIO $ putStrLn "Got a ping!"
client :: SendPort Ping -> Process ()
client sPing =
    sendChan sPing Ping
ignition :: Process ()
ignition = do
    -- start the server
    sPing <- spawnChannelLocal server
    -- start the client
    spawnLocal $ client sPing
    liftIO $ threadDelay 100000 -- wait a while
main :: IO ()
main = do
    Right transport <- createTransport "127.0.0.1" "8080"
                            defaultTCPParameters
    node <- newLocalNode transport initRemoteTable
    runProcess node ignition

This little package gives us a chance to look at three big pieces of Cloud Haskell, the Serializable typeclass, the Process monad, and channels.

Serializable

Actors send messages to each other. As programmers, we see the messages in nice high-level form (eg. Ping), but somewhere along the way, these messages are going to have to be encoded to something we can ship around on a network. Cloud Haskell makes this encoding explicit, but reasonably convenient at the same time. Things can be messages if they implement the Serializable typeclass, which is done indirectly by implementing Binary and deriving Typeable. You won't be starting from scratch, as implementations are already provided for primitives and some commonly used data structures.

Things which don't make sense as messages are deliberately left unserializable, for example MVar and TVar, which are only meaningful in the context of threads with a shared memory. Our Cloud Haskell program is perfectly free to use these constructs within processes (or within processes on the same machine; a bit more on that below), just not to ship them around.

Process

We use “process” to mean “actor” in a similar fashion as Erlang, in other words nothing nearly so heavy as an operating system process. One different with Erlang, however, is that Cloud Haskell allows for both actor style concurrency and the thread-based approach. The infrastructure gears you towards using the actor model when talking across machines, but on the same machine, you could also conveniently do things the old way. Want to use STM to pass notes between processes? Fine, just spawn them locally via spawnLocal and give them a common TVar.

As for the Process monad, we see again the idea of special monad either for special kinds of sequencing. Here the idea is that things like sending/receiving messages or spawning other processes only makes sense for processes, and so you can only do these things in a “process context”. Process implements MonadIO, though, so any input/output you'd like to do within a process is merely a liftIO away. Going the other way, running a process from IO, you would do with the runProcess function.

Channels

Cloud Haskell provides a notion of channels (somewhat similar to those we introduced in the last word of the month), typed unidirectional pipelines that go from one process to another. Using them is optional (there are simpler ways to bop messages back and forth), but worth trying out for the promise of sending messages only to processes that will understand them. Below is a quick glance at channels in action:

data SendPort a     -- Serializable
data ReceivePort a  -- NOT Serializable
newChan     :: Serializable a => Process (SendPort a, ReceivePort a)
sendChan    :: Serializable a => SendPort a -> a -> Process ()
receiveChan :: Serializable a => ReceivePort a -> Process a

A channel comes with a send and receive port, both of which are parameterised on the same type variable. Creating a Ping channel thus gives a ReceivePort Ping out of which only Ping's will ever emerge, and a SendPort Ping into which we can only put Ping's. This looks a lot more attractive when you work with multiple channels. Replying to pings with pongs, for example, would require us to create a second channel with a send a receive port of its own, which means we have now 4 ports to juggle! Having the type distinctions makes things a bit clearer: SendPort Ping vs ReceivePort Ping vs SendPort Pong, vs ReceivePort Pong.

Finally, it's worth noticing that SendPort's are themselves Serializable, meaning that they can be copied and shipped around to other processes possibly on other computers. This allows a channel to accept data from more than one place, and also makes for idioms like including a reply-to SendPort in your messages. ReceivePort's on the other hand are (deliberately) left unserializable which leaves them tied to single computer.

Ping? What happened to Pong?

Our little example was more “hello wo” than “hello world”; we'd only managed to send a Ping without even thinking about sending Pong's back. Want to try your hand at Cloud Haskell? Here's a great opportunity!

1. [Easy] Start with a cabal install distributed-process and make sure you can run this example. Note that you'll need GHC 7.4.1 and up for this

2. [Less easy] Next, add a new Pong message (as a separate data type), extending the server to send this message back, and the client to receive that reply. There are some puzzle pieces to work through here. How does the server know where to send its replies? Moreover, how do we keep the server nice and decoupled from the client? We want it to receive pings from any client, and send a reply back to the ping'er (and not just some hard-coded client). Hint: you can solve this without touching ignition or main. Remember that SendPort is Serializable!

3. [Easy] You now have a single ping/pong interaction. Can you make the game go back and forth indefinitely (or until the threadDelay ends)? Hint: have a look at Control.Monad; it's not essential, but it's a bit nicer.

Conclusion

Stepping back from the technology a bit, we have introduced the notion of actors as a concurrency abstraction on a par with threads. While there's nothing that makes them specific to distributed programming, they do seem to fit nicely to the problem and have been used to great effect before. Cloud Haskell is one attempt to apply this actor model, taking some of the ideas from Erlang, and combining them with Haskell's purity and type system.

You might notice that in a word of the month about distributed programming, we've kept things on a single machine, alas! Indeed, we have not been able to do Cloud Haskell justice in this article, but we have hopefully laid some foundations by introducing some of the basic layers, Serializable messages, processes, and channels. To escape from one-machine-island, we would need to get to grips with two more concepts, nodes and closures.

Nodes can basically be thought of as separate machines (you could run multiple nodes on the same machine if you wanted to, say for development purposes). This makes for three layers: nodes (machines), which contain processes (actors), which can run any number of threads they wanted. We saw how processes can communicate by sending each other messages across channels; what we've left out is the crucial detail of what happens when the processes live on different nodes. The good news here is “nothing special”, still messages across channels. The bad news is a bit of infrastructural fiddliness setting up the nodes in the first place, assigning them to roles, and spawning remote processes… for which we need to know about closures.

The basic story with closures is that we need to be able send functions back and forth in order to do anything really useful with Cloud Haskell, and to send functions we need to say how they are Serializable. This would be easy enough — assume for now that all nodes are running the same code and just send “run function foo” style instructions — were it not for the fact that Haskellers do all sorts of crazy things with functions all the time (partially applying them, returning them from other function…), crazy things that introduce free variables. Expressing the serializability of function-and-its-free-variables was a source of furious head-scratching for a while until somebody hit on the old Henry T. Ford idea: You can have any free variables you want so long as they are a ByteString.

Where to from here? If you're looking for more introductory stuff and have not already seen, try Simon Peyton Jones's presentation of Cloud Haskell to the Scala community (1h video). Edsko has been hard at work at the distributed-process Haddock, so it's worth checking out when you're ready to roll up your sleeves and get hacking. It'd be a very good idea to have a look at the simplelocalnet backend, which will help you get started with the nitty gritty node management issues when you start yearning to go distributed. That's the practical stuff, but don't forget to read the Cloud Haskell paper either! The API has some slight differences (for example, ProcessM has since been renamed to Process), but it should be fairly straightforwardly transferable to the new package. It's likely we'll need a wider spectrum of documentation to bring more Cloud Haskellers into the fold (early days, eh?). Hopefully this word of the month will help you get started, and maybe in turn write a blog post of your own? Happy Distributed Haskell'ing!