Cloud Haskell Appetiser: a taste of distributed programming in Haskell

HTML version: http://www.well-typed.com/blog/68 # Cloud Haskell Appetiser: (part 2 of Parallel Haskell Digest 11) Hello Haskellers! We mentioned in the [last digest][phd-11] that we'd have just a tiny bit more to say about Parallel Haskell. As promised, here is the completed word of month on *actors* and their use in Cloud Haskell. It so happens — what a coincidence! — that Well-Typed's Edsko de Vries has recently published a beta version of the new [distributed-process][dp-announce] implementation on Hackage. We'd love it if you could give it a try let us know any trouble you ran into or ways we could improve things. To help push things along a bit, this word of month will be using the new distributed-process implementation. Also, have you had a chance to fill out the Parallel Haskell Digest Survey? It's collecting data for another couple of weeks. Anything you can tell us in the survey will inform future efforts in building the Haskell community, so if you've got a couple of minutes before Cloud Haskell Time, head over to [Parallel Haskell Digest Survey](http://goo.gl/bP2fn) Many thanks! ## Word of the month The word of the month series has given us a chance to survey the arsenal of Haskell parallelism and concurrency constructs: * some low level foundations (sparks and threads), * three ways to do parallelism (parallel arrays, strategies, dataflow), * and some concurrency abstractions (locks, transactions, channels) The Haskell approach has been to explicitly recognise the vastness of the parallelism/concurrency space, in other words, to provide a multitude of right tools for a multitude of right jobs. Better still, the tools we have are largely interoperable, should we find ourselves with jobs that don't neatly fit into a single category. The Haskell of 2012 may be in a great place for parallelism and concurrency, but don't think this is the end of the story! What we've seen so far is only a snapshot of the technology as it hurtles through the twenty-tens (How quaint are we, Future Haskeller?). While we can't say what exactly the future will bring, we can look at one of the directions that Haskell might branch into in the coming decade. The series so far has focused on things you might do with a single computer, using parallelism to speed up your software, or using concurrency abstractions to preserve your sanity in the face of non-determinism. But now what if you have more than one computer? ### Actors Our final word of the month is *actor*. Actors are not specific to distributed programming; they are really more of a low level concurrency abstraction on a par with threads. And they certainly aren't new either. The actor model has been around since the 70s at least, and has been seriously used for distributed programming since the late 80s with Erlang. So what makes an actor an actor? Let's compare with threads +----------------------------------+---------------------------------+ | **Actor** | **Thread** | +==================================+=================================+ | can create more actors | can create more threads | +----------------------------------+---------------------------------+ | can have private local state | can have private local state | +----------------------------------+---------------------------------+ | has NO shared state | has limited shared state | | (isolated from other actors!) | | +----------------------------------+---------------------------------+ | communicates with other actors | communicates with other | | via asynchronous message passing | threads via shared variables | +----------------------------------+---------------------------------+ The essential difference between actors and threads is the isolation and message passing. There aren't any holes punched into lids here, but you can always shine a message from one jam jar to another, perhaps hoping they send you one of their own. The appeal of actors is thus a kind of simplicity, where avoiding shared state eliminates a class of concurrency bugs by definition, and where each actor can be reasoned about in isolation of its brethren. This sort of thing may perhaps strike a chord with us functional programmers, and actually, there is quite a bit of actor-related work in Haskell: a handful of packages offering the actor as concurrency primitive, Martin Sulzmann's [multi-headed twist][mh-actors] on the model; [Communicating Haskell Processes][chp] exploring an actor-ish cousin known as CSP. Finally, there's [Cloud Haskell][ch-pdf], which in explicit homage to Erlang, applies the actor model to distributed programming. ### Glimpse of Cloud Haskell We'll be taking a quick look at Cloud Haskell in this word of the month, unfortunately with only the most fleeting of glimpses. If squirting money between bank accounts is the transactional hello world, playing ping pong must surely be its distributed counterpart. Before working up to that, we first start with half a hello. The following example creates three processes — “process” is the Erlang-inspired word for the actor here — one which receives `Ping` messages and just prints them to screen, one which sends a single `Ping` message, and finally one which fires up the first two processes: {-# LANGUAGE DeriveDataTypeable #-} module Main where import Control.Concurrent ( threadDelay ) import Data.Binary import Data.Typeable import Control.Distributed.Process import Control.Distributed.Process.Node import Network.Transport.TCP -- Serializable (= Binary + Typeable) data Ping = Ping deriving (Typeable) instance Binary Ping where put Ping = putWord8 0 get = do { getWord8; return Ping } server :: ReceivePort Ping -> Process () server rPing = do Ping <- receiveChan rPing liftIO $ putStrLn "Got a ping!" client :: SendPort Ping -> Process () client sPing = sendChan sPing Ping ignition :: Process () ignition = do -- start the server sPing <- spawnChannelLocal server -- start the client spawnLocal $ client sPing liftIO $ threadDelay 100000 -- wait a while main :: IO () main = do Right transport <- createTransport "127.0.0.1" "8080" defaultTCPParameters node <- newLocalNode transport initRemoteTable runProcess node ignition This little package gives us a chance to look at three big pieces of Cloud Haskell, the `Serializable` typeclass, the `Process` monad, and channels. #### Serializable Actors send messages to each other. As programmers, we see the messages in nice high-level form (eg. `Ping`), but somewhere along the way, these messages are going to have to be encoded to something we can ship around on a network. Cloud Haskell makes this encoding explicit, but reasonably convenient at the same time. Things can be messages if they implement the `Serializable` typeclass, which is done indirectly by implementing `Binary` and deriving `Typeable`. You won't be starting from scratch, as implementations are already provided for primitives and some commonly used data structures. Things which don't make sense as messages are deliberately left unserializable, for example `MVar` and `TVar`, which are only meaningful in the context of threads with a shared memory. Our Cloud Haskell program is perfectly free to use these constructs within processes (or within processes on the same machine; a bit more on that below), just not to ship them around. #### Process We use “process” to mean “actor” in a similar fashion as Erlang, in other words nothing nearly so heavy as an operating system process. One different with Erlang, however, is that Cloud Haskell allows for both actor style concurrency and the thread-based approach. The infrastructure gears you towards using the actor model when talking across machines, but on the same machine, you could also conveniently do things the old way. Want to use STM to pass notes between processes? Fine, just spawn them locally via `spawnLocal` and give them a common `TVar`. As for the `Process` monad, we see again the idea of special monad either for special kinds of sequencing. Here the idea is that things like sending/receiving messages or spawning other processes only makes sense for processes, and so you can only do these things in a “process context”. `Process` implements `MonadIO`, though, so any input/output you'd like to do within a process is merely a `liftIO` away. Going the other way, running a process from IO, you would do with the `runProcess` function. #### Channels Cloud Haskell provides a notion of channels (somewhat similar to those we introduced in the last word of the month), typed unidirectional pipelines that go from one process to another. Using them is optional (there are simpler ways to bop messages back and forth), but worth trying out for the promise of sending messages only to processes that will understand them. Below is a quick glance at channels in action: data SendPort a -- Serializable data ReceivePort a -- NOT Serializable newChan :: Serializable a => Process (SendPort a, ReceivePort a) sendChan :: Serializable a => SendPort a -> a -> Process () receiveChan :: Serializable a => ReceivePort a -> Process a A channel comes with a send and receive port, both of which are parameterised on the same type variable. Creating a `Ping` channel thus gives a `ReceivePort Ping` out of which only `Ping`'s will ever emerge, and a `SendPort Ping` into which we can only put `Ping`'s. This looks a lot more attractive when you work with multiple channels. Replying to pings with pongs, for example, would require us to create a second channel with a send a receive port of its own, which means we have now 4 ports to juggle! Having the type distinctions makes things a bit clearer: `SendPort Ping` vs `ReceivePort Ping` vs `SendPort Pong`, vs `ReceivePort Pong`. Finally, it's worth noticing that `SendPort`'s are themselves Serializable, meaning that they can be copied and shipped around to other processes possibly on other computers. This allows a channel to accept data from more than one place, and also makes for idioms like including a reply-to `SendPort` in your messages. `ReceivePort`'s on the other hand are (deliberately) left unserializable which leaves them tied to single computer. ### Ping? What happened to Pong? Our little example was more “hello wo” than “hello world”; we'd only managed to send a `Ping` without even thinking about sending `Pong`'s back. Want to try your hand at Cloud Haskell? Here's a great opportunity! 1. **[Easy]** Start with a `cabal install distributed-process` and make sure you can run this example. Note that you'll need GHC 7.4.1 and up for this 2. **[Less easy]** Next, add a new `Pong` message (as a separate data type), extending the server to send this message back, and the client to receive that reply. There are some puzzle pieces to work through here. How does the server know where to send its replies? Moreover, how do we keep the server nice and decoupled from the client? We want it to receive pings from any client, and send a reply back to the ping'er (and not just some hard-coded client). Hint: you can solve this without touching `ignition` or `main`. Remember that `SendPort` is `Serializable`! 3. **[Easy]** You now have a single ping/pong interaction. Can you make the game go back and forth indefinitely (or until the `threadDelay` ends)? Hint: have a look at `Control.Monad`; it's not essential, but it's a bit nicer. ### Conclusion Stepping back from the technology a bit, we have introduced the notion of actors as a concurrency abstraction on a par with threads. While there's nothing that makes them specific to distributed programming, they do seem to fit nicely to the problem and have been used to great effect before. Cloud Haskell is one attempt to apply this actor model, taking some of the ideas from Erlang, and combining them with Haskell's purity and type system. You might notice that in a word of the month about distributed programming, we've kept things on a single machine, alas! Indeed, we have not been able to do Cloud Haskell justice in this article, but we have hopefully laid some foundations by introducing some of the basic layers, `Serializable` messages, processes, and channels. To escape from one-machine-island, we would need to get to grips with two more concepts, nodes and closures. Nodes can basically be thought of as separate machines (you could run multiple nodes on the same machine if you wanted to, say for development purposes). This makes for three layers: nodes (machines), which contain processes (actors), which can run any number of threads they wanted. We saw how processes can communicate by sending each other messages across channels; what we've left out is the crucial detail of what happens when the processes live on different nodes. The good news here is “nothing special”, still messages across channels. The bad news is a bit of infrastructural fiddliness setting up the nodes in the first place, assigning them to roles, and spawning remote processes… for which we need to know about closures. The basic story with closures is that we need to be able send functions back and forth in order to do anything really useful with Cloud Haskell, and to send functions we need to say how they are `Serializable`. This would be easy enough — assume for now that all nodes are running the same code and just send “run function `foo`” style instructions — were it not for the fact that Haskellers do all sorts of crazy things with functions all the time (partially applying them, returning them from other function…), crazy things that introduce free variables. Expressing the serializability of function-and-its-free-variables was a source of furious head-scratching for a while until somebody hit on the old Henry T. Ford idea: You can have any free variables you want so long as they are a ByteString. Where to from here? If you're looking for more introductory stuff and have not already seen, try Simon Peyton Jones's presentation of Cloud Haskell to the Scala community ([1h video][ch-skills]). Edsko has been hard at work at the [distributed-process Haddock][dp-haddock], so it's worth checking out when you're ready to roll up your sleeves and get hacking. It'd be a very good idea to have a look at the [simplelocalnet backend][simple], which will help you get started with the nitty gritty node management issues when you start yearning to go distributed. That's the practical stuff, but don't forget to read the [Cloud Haskell paper][ch-pdf] either! The API has some slight differences (for example, `ProcessM` has since been renamed to `Process`), but it should be fairly straightforwardly transferable to the new package. It's likely we'll need a wider spectrum of documentation to bring more Cloud Haskellers into the fold (early days, eh?). Hopefully this word of the month will help you get started, and maybe in turn write a blog post of your own? Happy Distributed Haskell'ing! [ch-github]: https://github.com/jepst/CloudHaskell [ch-pdf]: http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/remote... [ch-skills]: http://skillsmatter.com/podcast/home/haskell-cloud/js-4179 [chp]: http://www.cs.kent.ac.uk/projects/ofa/chp/ [dist-p]: https://github.com/haskell-distributed/distributed-process [dp-announce]: https://groups.google.com/d/topic/parallel-haskell/dw5UPEg1ePI/discussion [dp-haddock]: http://hackage.haskell.org/packages/archive/distributed-process/latest/doc/h... [mh-actors]: http://sulzmann.blogspot.co.uk/2008/10/actors-with-multi-headed-receive.html [phd-11]: http://www.well-typed.com/blog/67 [simple]: http://hackage.haskell.org/packages/archive/distributed-process-simplelocaln... [survey]: http://goo.gl/bP2fn -- Eric Kow http://erickow.com
participants (1)
-
Eric Kow