Cloud Haskell real usage example

Hello everyone. I'm taking my first steps in Cloud Haskell and got some unexpected behaviors. I used the code from Raspberry Pi in a Haskell Cloud [1] as a first example. Did try to switch the code to use Template Haskell with no luck, stick with the verbose style. I changed some of the code, from ProcessId-based messaging to typed channel to receive the Pong; using "startSlave" to start the worker nodes; and changed the master node to loop forever sending pings to the worker nodes. The unexpected behaviors: - Dropping a worker node while the master is running makes the master node to crash. - Master node do not see worker nodes started after the master process. In order to fix this, I tried to "findSlaves" at the start of the master process and send ping to only these ones, ignoring the list of NodeId enforced by the type signature of "startMaster". Now the master finds new slaves. The bad thing is that when I close one of the workers, the master process freezes. It simply stop doing anything. No more messages and no more Pings to other slaves. :( My view of Cloud Haskell usage would be something similar to this: a master node sending work to slaves; slave instances getting up or down based on demand. So, the master node should be slave-failure-proof and also find new slaves somehow. Am I misunderstanding the big picture of Cloud Haskell or doing anything wrong in the following code? Code (skipped imports and wiring stuff): -- newtype Ping = Ping (SendPort Pong) deriving (Typeable, Binary, Show) newtype Pong = Pong ProcessId deriving (Typeable, Binary, Show) worker :: Ping -> Process () worker (Ping sPong) = do wId <- getSelfPid say "Got a Ping!" sendChan sPong (Pong wId) master :: Backend -> [NodeId] -> Process () master backend _ = forever $ do workers <- findSlaves backend say $ "Slaves: " ++ show workers (sPong, rPong) <- newChan forM_ workers $ \w -> do say $ "Sending a Ping to " ++ (show w) ++ "..." spawn w (workerClosure (Ping sPong)) say $ "Waiting for reply from " ++ (show (length workers)) ++ " worker(s)" replicateM_ (length workers) $ do (Pong wId) <- receiveChan rPong say $ "Got back a Pong from " ++ (show $ processNodeId wId) ++ "!" (liftIO . threadDelay) 2000000 -- Wait a bit before return main = do prog <- getProgName args <- getArgs case args of ["master", host, port] -> do backend <- initializeBackend host port remoteTable startMaster backend (master backend) ["worker", host, port] -> do backend <- initializeBackend host port remoteTable startSlave backend _ -> putStrLn $ "usage: " ++ prog ++ " (master | worker) host port" -- [1] http://alenribic.com/writings/post/raspberry-pi-in-a-haskell-cloud

On Tue, Aug 21, 2012 at 9:01 PM, Thiago Negri
My view of Cloud Haskell usage would be something similar to this: a master node sending work to slaves; slave instances getting up or down based on demand. So, the master node should be slave-failure-proof and also find new slaves somehow.
Am I misunderstanding the big picture of Cloud Haskell or doing anything wrong in the following code?
(Disclaimer: I can't speak for Cloud Haskell's developers.) AFAIK this is CH's goal. However, they're not quite there yet. Their network implementation is still a lot naive as you're seeing =). Cheers, -- Felipe.

On Wed, Aug 22, 2012 at 8:30 AM, Felipe Almeida Lessa < felipe.lessa@gmail.com> wrote:
On Tue, Aug 21, 2012 at 9:01 PM, Thiago Negri
wrote: My view of Cloud Haskell usage would be something similar to this: a master node sending work to slaves; slave instances getting up or down based on demand. So, the master node should be slave-failure-proof and also find new slaves somehow.
Am I misunderstanding the big picture of Cloud Haskell or doing anything wrong in the following code?
(Disclaimer: I can't speak for Cloud Haskell's developers.)
AFAIK this is CH's goal. However, they're not quite there yet. Their network implementation is still a lot naive as you're seeing =).
I believe this behavior is due to the usage of channel, you just need to implement some kind of timeout function.
Cheers,
-- Felipe.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi Thiago,
Let me address your questions one by one.
On Wed, Aug 22, 2012 at 1:01 AM, Thiago Negri
Hello everyone. I'm taking my first steps in Cloud Haskell and got some unexpected behaviors.
I used the code from Raspberry Pi in a Haskell Cloud [1] as a first example. Did try to switch the code to use Template Haskell with no luck, stick with the verbose style.
I have pasted a version of your code that uses Template Haskell at http://hpaste.org/73520. Where did you get stuck?
I changed some of the code, from ProcessId-based messaging to typed channel to receive the Pong; using "startSlave" to start the worker nodes; and changed the master node to loop forever sending pings to the worker nodes.
The unexpected behaviors: - Dropping a worker node while the master is running makes the master node to crash.
There are two things going on here: 1. A bug in the SimpleLocalnet backend meant that if you dropped a worker node findSlaves might not return. I have fixed this and uploaded it to Hackage as version 0.2.0.5. 2. But even with this fix, you will still need to take into account that workers may disappear once they have been reported by findSlaves. spawn will actually throw an exception if the specified node is unreachable (it is debatable whether this is the right behaviour -- see below).
- Master node do not see worker nodes started after the master process.
Yes, startMaster is merely a convenience function. I have modified the documentation to specify more clearly what startMaster does: -- | 'startMaster' finds all slaves /currently/ available on the local network, -- redirects all log messages to itself, and then calls the specified process, -- passing the list of slaves nodes. -- -- Terminates when the specified process terminates. If you want to terminate -- the slaves when the master terminates, you should manually call -- 'terminateAllSlaves'. -- -- If you start more slave nodes after having started the master node, you can -- discover them with later calls to 'findSlaves', but be aware that you will -- need to call 'redirectLogHere' to redirect their logs to the master node. -- -- Note that you can use functionality of "SimpleLocalnet" directly (through -- 'Backend'), instead of using 'startMaster'/'startSlave', if the master/slave -- distinction does not suit your application. Note that with these modifications there is still something slightly unfortunate: if you delete a worker, and then restart it *at the same port*, the master will not see it. There is a very good reason for this: Cloud Haskell guarantees reliable ordered message passing, and we want a clear semantics for this (unlike, say, in Erlang, where you might send messages M1, M2 and M3 from P to Q, and Q might receive M1, M3 but not M2, under certain circumstances). We (developers of Cloud Haskell, Simon Peyton-Jones and some others) are still debating over what the best approach is here; in the meantime, if you restart a worker node, just give a different port number. Let me know if you have any other questions, and feel free to open an issue at https://github.com/haskell-distributed/distributed-process/issues?state=open if you think you found a bug. Edsko

| I have pasted a version of your code that uses Template Haskell at | http://hpaste.org/73520. Where did you get stuck? Your version worked like a charm. I'm quite new to Haskell, so I was trying desperately to get TH working: forgot to quote "worker" at mkClosure. | 1. A bug in the SimpleLocalnet backend meant that if you dropped a | worker node findSlaves might not return. I have fixed this and | uploaded it to Hackage as version 0.2.0.5. Updated to version 0.2.0.5 and it's working now. :-) | 2. But even with this fix, you will still need to take into account | that workers may disappear once they have been reported by findSlaves. | spawn will actually throw an exception if the specified node is | unreachable (it is debatable whether this is the right behaviour -- | see below). Added exception catching, thanks. | Note that with these modifications there is still something slightly | unfortunate: if you delete a worker, and then restart it *at the same | port*, the master will not see it. There is a very good reason for | this: Cloud Haskell guarantees reliable ordered message passing, and | we want a clear semantics for this (unlike, say, in Erlang, where you | might send messages M1, M2 and M3 from P to Q, and Q might receive M1, | M3 but not M2, under certain circumstances). We (developers of Cloud | Haskell, Simon Peyton-Jones and some others) are still debating over | what the best approach is here; in the meantime, if you restart a | worker node, just give a different port number. I trust you will make a good decision on this. By the way, my new code with TH and exception catching: http://hpaste.org/73548 Thanks, Thiago.
participants (4)
-
Edsko de Vries
-
Felipe Almeida Lessa
-
Thiago Negri
-
yi huang