Cloud Haskell Appetiser: a taste of distributed programming in Haskell

21 Jul 2012

      HTML version: http://www.well-typed.com/blog/68

# Cloud Haskell Appetiser: (part 2 of Parallel Haskell Digest 11)

Hello Haskellers! We mentioned in the [last digest][phd-11] that we'd
have just a tiny bit more to say about Parallel Haskell.  As promised,
here is the completed word of month on *actors* and their use in
Cloud Haskell.  It so happens — what a coincidence! — that Well-Typed's
Edsko de Vries has recently published a beta version of the new
[distributed-process][dp-announce] implementation on Hackage.  We'd love
it if you could give it a try let us know any trouble you ran into or
ways we could improve things. To help push things along a bit, this word
of month will be using the new distributed-process implementation.

Also, have you had a chance to fill out the Parallel Haskell Digest
Survey? It's collecting data for another couple of weeks. Anything you
can tell us in the survey will inform future efforts in building the
Haskell community, so if you've got a couple of minutes before Cloud
Haskell Time, head over to

[Parallel Haskell Digest Survey](http://goo.gl/bP2fn)

Many thanks!

## Word of the month

The word of the month series has given us a chance to survey the arsenal
of Haskell parallelism and concurrency constructs:

*    some low level foundations (sparks and threads),
*    three ways to do parallelism (parallel arrays, strategies, dataflow),
*    and some concurrency abstractions (locks, transactions, channels)

The Haskell approach has been to explicitly recognise the vastness of
the parallelism/concurrency space, in other words, to provide a
multitude of right tools for a multitude of right jobs. Better still,
the tools we have are largely interoperable, should we find ourselves
with jobs that don't neatly fit into a single category.

The Haskell of 2012 may be in a great place for parallelism and
concurrency, but don't think this is the end of the story! What we've
seen so far is only a snapshot of the technology as it hurtles through
the twenty-tens (How quaint are we, Future Haskeller?). While we can't
say what exactly the future will bring, we can look at one of the
directions that Haskell might branch into in the coming decade.
The series so far has focused on things you might do with a single
computer, using parallelism to speed up your software, or using
concurrency abstractions to preserve your sanity in the face of
non-determinism. But now what if you have more than one computer?

### Actors

Our final word of the month is *actor*. Actors are not specific to
distributed programming; they are really more of a low level concurrency
abstraction on a par with threads.  And they certainly aren't new
either.  The actor model has been around since the 70s at least, and has
been seriously used for distributed programming since the late 80s with
Erlang. So what makes an actor an actor?  Let's compare with threads

+----------------------------------+---------------------------------+
| **Actor**                        | **Thread**                      |
+==================================+=================================+
| can create more actors           | can create more threads         |
+----------------------------------+---------------------------------+
| can have private local state     | can have private local state    |
+----------------------------------+---------------------------------+
| has NO shared state              | has limited shared state        |
| (isolated from other actors!)    |                                 |
+----------------------------------+---------------------------------+
| communicates with other actors   | communicates with other         |
| via asynchronous message passing | threads via shared variables    |
+----------------------------------+---------------------------------+

The essential difference between actors and threads is the isolation and
message passing. There aren't any holes punched into lids here, but you
can always shine a message from one jam jar to another, perhaps hoping
they send you one of their own. The appeal of actors is thus a kind of
simplicity, where avoiding shared state eliminates a class of
concurrency bugs by definition, and where each actor can be reasoned
about in isolation of its brethren.

This sort of thing may perhaps strike a chord with us functional
programmers, and actually, there is quite a bit of actor-related work in
Haskell: a handful of packages offering the actor as concurrency
primitive, Martin Sulzmann's [multi-headed twist][mh-actors] on the
model; [Communicating Haskell Processes][chp] exploring an actor-ish
cousin known as CSP. Finally, there's [Cloud Haskell][ch-pdf], which in
explicit homage to Erlang, applies the actor model to distributed
programming.

### Glimpse of Cloud Haskell

We'll be taking a quick look at Cloud Haskell in this word of the month,
unfortunately with only the most fleeting of glimpses.  If squirting
money between bank accounts is the transactional hello world, playing
ping pong must surely be its distributed counterpart. Before working up
to that, we first start with half a hello. The following example creates
three processes — “process” is the Erlang-inspired word for the actor here
— one which receives `Ping` messages and just prints them to screen, one
which sends a single `Ping` message, and finally one which fires up
the first two processes:

    {-# LANGUAGE DeriveDataTypeable #-}

    module Main where

    import Control.Concurrent ( threadDelay )
    import Data.Binary
    import Data.Typeable

    import Control.Distributed.Process
    import Control.Distributed.Process.Node
    import Network.Transport.TCP

    -- Serializable (= Binary + Typeable)
    data Ping = Ping deriving (Typeable)

    instance Binary Ping where
        put Ping = putWord8 0
        get      = do { getWord8; return Ping }

    server :: ReceivePort Ping -> Process ()
    server rPing = do
        Ping <- receiveChan rPing
        liftIO $ putStrLn "Got a ping!"

    client :: SendPort Ping -> Process ()
    client sPing =
        sendChan sPing Ping

    ignition :: Process ()
    ignition = do
        -- start the server
        sPing <- spawnChannelLocal server
        -- start the client
        spawnLocal $ client sPing
        liftIO $ threadDelay 100000 -- wait a while

    main :: IO ()
    main = do
        Right transport <- createTransport "127.0.0.1" "8080"
                                defaultTCPParameters
        node <- newLocalNode transport initRemoteTable
        runProcess node ignition

This little package gives us a chance to look at three big pieces of Cloud
Haskell, the `Serializable` typeclass, the `Process` monad, and channels.

#### Serializable

Actors send messages to each other. As programmers, we see the messages
in nice high-level form (eg. `Ping`), but somewhere along the way, these
messages are going to have to be encoded to something we can ship around
on a network. Cloud Haskell makes this encoding explicit, but reasonably
convenient at the same time. Things can be messages if they implement
the `Serializable` typeclass, which is done indirectly by implementing
`Binary` and deriving `Typeable`. You won't be starting from scratch, as
implementations are already provided for primitives and some commonly
used data structures.

Things which don't make sense as messages are deliberately left
unserializable, for example `MVar` and `TVar`, which are only meaningful
in the context of threads with a shared memory. Our Cloud Haskell
program is perfectly free to use these constructs within processes
(or within processes on the same machine; a bit more on that below), just
not to ship them around.

#### Process

We use “process” to mean “actor” in a similar fashion as Erlang, in
other words nothing nearly so heavy as an operating system process.
One different with Erlang, however, is that Cloud Haskell allows for
both actor style concurrency and the thread-based approach. The
infrastructure gears you towards using the actor model when talking
across machines, but on the same machine, you could also conveniently
do things the old way.  Want to use STM to pass notes between processes?
Fine, just spawn them locally via `spawnLocal` and give them a common
`TVar`.

As for the `Process` monad, we see again the idea of special monad
either for special kinds of sequencing.  Here the idea is that things
like sending/receiving messages or spawning other processes only makes
sense for processes, and so you can only do these things in a “process
context”. `Process` implements `MonadIO`, though, so any input/output
you'd like to do within a process is merely a `liftIO` away. Going the
other way, running a process from IO, you would do with the `runProcess`
function.

#### Channels

Cloud Haskell provides a notion of channels (somewhat similar to
those we introduced in the last word of the month), typed unidirectional
pipelines that go from one process to another. Using them is optional
(there are simpler ways to bop messages back and forth), but worth
trying out for the promise of sending messages only to processes that
will understand them.  Below is a quick glance at channels in action:

    data SendPort a     -- Serializable
    data ReceivePort a  -- NOT Serializable

    newChan     :: Serializable a => Process (SendPort a, ReceivePort a)
    sendChan    :: Serializable a => SendPort a -> a -> Process ()
    receiveChan :: Serializable a => ReceivePort a -> Process a

A channel comes with a send and receive port, both of which are
parameterised on the same type variable. Creating a `Ping` channel thus
gives a `ReceivePort Ping` out of which only `Ping`'s will ever emerge,
and a `SendPort Ping` into which we can only put `Ping`'s. This looks a
lot more attractive when you work with multiple channels. Replying to pings
with pongs, for example, would require us to create a second channel with a
send a receive port of its own, which means we have now 4 ports to juggle!
Having the type distinctions makes things a bit clearer: `SendPort Ping` vs
`ReceivePort Ping` vs `SendPort Pong`, vs `ReceivePort Pong`.

Finally, it's worth noticing that `SendPort`'s are themselves
Serializable, meaning that they can be copied and shipped around to
other processes possibly on other computers. This allows a channel to
accept data from more than one place, and also makes for idioms like
including a reply-to `SendPort` in your messages. `ReceivePort`'s on the
other hand are (deliberately) left unserializable which leaves them
tied to single computer.

### Ping? What happened to Pong?

Our little example was more “hello wo” than “hello world”; we'd only
managed to send a `Ping` without even thinking about sending `Pong`'s
back.  Want to try your hand at Cloud Haskell?  Here's a great
opportunity!

1. **[Easy]** Start with a `cabal install distributed-process` and
   make sure you can run this example.  Note that you'll need GHC 7.4.1
   and up for this

2. **[Less easy]** Next, add a new `Pong` message (as a separate data
   type), extending the server to send this message back, and the client
   to receive that reply.  There are some puzzle pieces to work through
   here.  How does the server know where to send its replies?  Moreover,
   how do we keep the server nice and decoupled from the client?  We
   want it to receive pings from any client, and send a reply back to
   the ping'er (and not just some hard-coded client).  Hint: you can
   solve this without touching `ignition` or `main`.  Remember that
   `SendPort` is `Serializable`!

3. **[Easy]** You now have a single ping/pong interaction. Can you make
   the game go back and forth indefinitely (or until the `threadDelay` ends)?
   Hint: have a look at `Control.Monad`; it's not essential, but it's a bit
   nicer.

### Conclusion

Stepping back from the technology a bit, we have introduced the notion
of actors as a concurrency abstraction on a par with threads. While
there's nothing that makes them specific to distributed programming,
they do seem to fit nicely to the problem and have been used to great
effect before. Cloud Haskell is one attempt to apply this actor model,
taking some of the ideas from Erlang, and combining them with Haskell's
purity and type system.

You might notice that in a word of the month about distributed
programming, we've kept things on a single machine, alas! Indeed, we
have not been able to do Cloud Haskell justice in this article, but we
have hopefully laid some foundations by introducing some of the basic
layers, `Serializable` messages, processes, and channels.  To escape
from one-machine-island, we would need to get to grips with two more
concepts, nodes and closures.

Nodes can basically be thought of as separate machines (you could run
multiple nodes on the same machine if you wanted to, say for development
purposes). This makes for three layers: nodes (machines), which contain
processes (actors), which can run any number of threads they wanted. We
saw how processes can communicate by sending each other messages across
channels; what we've left out is the crucial detail of what happens when
the processes live on different nodes. The good news here is “nothing
special”, still messages across channels. The bad news is a bit of
infrastructural fiddliness setting up the nodes in the first place,
assigning them to roles, and spawning remote processes… for which we
need to know about closures.

The basic story with closures is that we need to be able send functions
back and forth in order to do anything really useful with Cloud Haskell,
and to send functions we need to say how they are `Serializable`. This
would be easy enough — assume for now that all nodes are running the
same code and just send “run function `foo`” style instructions — were
it not for the fact that Haskellers do all sorts of crazy things with
functions all the time (partially applying them, returning them from
other function…), crazy things that introduce free variables. Expressing
the serializability of function-and-its-free-variables was a source of
furious head-scratching for a while until somebody hit on the old Henry
T. Ford idea: You can have any free variables you want so long as they
are a ByteString.

Where to from here? If you're looking for more introductory stuff and
have not already seen, try Simon Peyton Jones's presentation of Cloud
Haskell to the Scala community ([1h video][ch-skills]).  Edsko has been
hard at work at the [distributed-process Haddock][dp-haddock], so it's
worth checking out when you're ready to roll up your sleeves and get
hacking.  It'd be a very good idea to have a look at the [simplelocalnet
backend][simple], which will help you get started with the nitty gritty
node management issues when you start yearning to go distributed.
That's the practical stuff, but don't forget to read the [Cloud Haskell
paper][ch-pdf] either!  The API has some slight differences (for
example, `ProcessM` has since been renamed to `Process`), but it should
be fairly straightforwardly transferable to the new package. It's likely
we'll need a wider spectrum of documentation to bring more Cloud
Haskellers into the fold (early days, eh?).  Hopefully this word of the
month will help you get started, and maybe in turn write a blog post of
your own?  Happy Distributed Haskell'ing!

[ch-github]: https://github.com/jepst/CloudHaskell
[ch-pdf]: http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/remote...
[ch-skills]: http://skillsmatter.com/podcast/home/haskell-cloud/js-4179
[chp]: http://www.cs.kent.ac.uk/projects/ofa/chp/
[dist-p]: https://github.com/haskell-distributed/distributed-process
[dp-announce]: https://groups.google.com/d/topic/parallel-haskell/dw5UPEg1ePI/discussion
[dp-haddock]: http://hackage.haskell.org/packages/archive/distributed-process/latest/doc/h...
[mh-actors]: http://sulzmann.blogspot.co.uk/2008/10/actors-with-multi-headed-receive.html
[phd-11]: http://www.well-typed.com/blog/67
[simple]: http://hackage.haskell.org/packages/archive/distributed-process-simplelocaln... 
[survey]: http://goo.gl/bP2fn

-- 
Eric Kow http://erickow.com

Eric Kow

tags

participants (1)