monadic MapReduce

Hi. I have a function that do some IO (take a file path, read the file, parse, and return some data), and I would like to parallelize it, so that multiple files can be parsed in parallel. I would like to use the simple mapReduce function, from Real Word Haskell: mapReduce :: Strategy b -- evaluation strategy for mapping -> (a -> b) -- map function -> Strategy c -- evaluation strategy for reduction -> ([b] -> c) -- reduce function -> [a] -- list to map over -> c mapReduce mapStrat mapFunc reduceStrat reduceFunc input = mapResult `pseq` reduceResult where mapResult = parMap mapStrat mapFunc input reduceResult = reduceFunc mapResult `using` reduceStrat Is this possible? Thanks Manlio Perillo I

On Sun, Mar 01, 2009 at 07:25:56PM +0100, Manlio Perillo wrote:
Hi.
I have a function that do some IO (take a file path, read the file, parse, and return some data), and I would like to parallelize it, so that multiple files can be parsed in parallel.
I would like to use the simple mapReduce function, from Real Word Haskell:
mapReduce :: Strategy b -- evaluation strategy for mapping -> (a -> b) -- map function -> Strategy c -- evaluation strategy for reduction -> ([b] -> c) -- reduce function -> [a] -- list to map over -> c
mapReduce mapStrat mapFunc reduceStrat reduceFunc input = mapResult `pseq` reduceResult where mapResult = parMap mapStrat mapFunc input reduceResult = reduceFunc mapResult `using` reduceStrat
Is this possible?
Thanks Manlio Perillo
Would this work? Read in each file into a string (or byteString) using a lazy function and then call mapReduce with the strings instead of file paths. import qualified Data.Bytestring.Lazy.Char8 as L do let handles = map (openFile ) files strings <- mapM L.hGetContents handles let result = mapReduce ... The actual work of reading in the file should happen on-demand inside the parsing function called by mapReduce. I would like to know if this gives you the speedup you expect. Regards, Anish

Anish Muttreja ha scritto:
On Sun, Mar 01, 2009 at 07:25:56PM +0100, Manlio Perillo wrote:
Hi.
I have a function that do some IO (take a file path, read the file, parse, and return some data), and I would like to parallelize it, so that multiple files can be parsed in parallel.
I would like to use the simple mapReduce function, from Real Word Haskell:
mapReduce :: Strategy b -- evaluation strategy for mapping -> (a -> b) -- map function -> Strategy c -- evaluation strategy for reduction -> ([b] -> c) -- reduce function -> [a] -- list to map over -> c
mapReduce mapStrat mapFunc reduceStrat reduceFunc input = mapResult `pseq` reduceResult where mapResult = parMap mapStrat mapFunc input reduceResult = reduceFunc mapResult `using` reduceStrat
Is this possible?
Thanks Manlio Perillo
Would this work?
I suspect that it will not work..
Read in each file into a string (or byteString) using a lazy function and then call mapReduce with the strings instead of file paths.
import qualified Data.Bytestring.Lazy.Char8 as L do let handles = map (openFile ) files strings <- mapM L.hGetContents handles let result = mapReduce ...
The actual work of reading in the file should happen on-demand inside the parsing function called by mapReduce.
By doing this I will probably lose any control about file resources usage. Thanks Manlio

On Mon, Mar 02, 2009 at 04:10:41PM +0100, Manlio Perillo wrote:
Anish Muttreja ha scritto:
On Sun, Mar 01, 2009 at 07:25:56PM +0100, Manlio Perillo wrote:
Hi.
I have a function that do some IO (take a file path, read the file, parse, and return some data), and I would like to parallelize it, so that multiple files can be parsed in parallel.
I would like to use the simple mapReduce function, from Real Word Haskell:
mapReduce :: Strategy b -- evaluation strategy for mapping -> (a -> b) -- map function -> Strategy c -- evaluation strategy for reduction -> ([b] -> c) -- reduce function -> [a] -- list to map over -> c
mapReduce mapStrat mapFunc reduceStrat reduceFunc input = mapResult `pseq` reduceResult where mapResult = parMap mapStrat mapFunc input reduceResult = reduceFunc mapResult `using` reduceStrat
Is this possible?
Thanks Manlio Perillo
Would this work?
I suspect that it will not work..
Read in each file into a string (or byteString) using a lazy function and then call mapReduce with the strings instead of file paths.
import qualified Data.Bytestring.Lazy.Char8 as L do let handles = map (openFile ) files strings <- mapM L.hGetContents handles let result = mapReduce ...
The actual work of reading in the file should happen on-demand inside the parsing function called by mapReduce.
By doing this I will probably lose any control about file resources usage.
OK. How about this. Is there a reason why I can't replace the variables b and c in the type signature of mapReduce with with (IO b') and (IO c'). b and c can be any types. mapReduce :: Strategy (IO b') -- evaluation strategy for mapping -> (a -> IO b') -- map function -> Strategy (IO c') -- evaluation strategy for reduction -> ([IO b'] -> (IO c')) -- reduce function -> [a] -- list to map over -> (IO c') Just remember to wrap all values back in the IO monad. Anish
Thanks Manlio

How about this. Is there a reason why I can't replace the variables b and c in the type signature of mapReduce with with (IO b') and (IO c'). b and c can be any types.
mapReduce :: Strategy (IO b') -- evaluation strategy for mapping -> (a -> IO b') -- map function -> Strategy (IO c') -- evaluation strategy for reduction -> ([IO b'] -> (IO c')) -- reduce function -> [a] -- list to map over -> (IO c')
Just remember to wrap all values back in the IO monad.
Remember, the idea of map-reduce is to provide a very restrictive programming interface so that you have a lot of flexibility in your execution strategy. If you start loosening the interface you will still be able to execute the program, but you may not be able to perform all the great optimizations you want to perform. For example, if you are using IO actions that are stateful, what are the semantics? Can one map action affect other map actions? Does this imply an ordering of the map functions? Does this imply they all run on the same machine or at least have state communicated between the machines on which they run? The austere interface precludes any of these issues, and therein lies the beauty.
Anish
Btw. I prefer the sawzall formulation over the map-reduce formulation. A sawzall program just specifies how to map some data to a monoid and the system is free to mappend the monoid values in whatever order it wants (by applying associativity). Tim Newsham http://www.thenewsh.com/~newsham/

On Tue, Mar 03, 2009 at 07:27:35AM -1000, Tim Newsham wrote:
How about this. Is there a reason why I can't replace the variables b and c in the type signature of mapReduce with with (IO b') and (IO c'). b and c can be any types.
mapReduce :: Strategy (IO b') -- evaluation strategy for mapping -> (a -> IO b') -- map function -> Strategy (IO c') -- evaluation strategy for reduction -> ([IO b'] -> (IO c')) -- reduce function -> [a] -- list to map over -> (IO c')
Just remember to wrap all values back in the IO monad.
Remember, the idea of map-reduce is to provide a very restrictive programming interface so that you have a lot of flexibility in your execution strategy. If you start loosening the interface you will still be able to execute the program, but you may not be able to perform all the great optimizations you want to perform. For example, if you are using IO actions that are stateful, what are the semantics? Can one map action affect other map actions? Does this imply an ordering of the map functions? Does this imply they all run on the same machine or at least have state communicated between the machines on which they run?
Yes, IO actions in mapReduce will introduce all these questions. I thought Manlio's example was a case where the IO actions consist only of reading an entire file into a string and no ordering or dependence between map actions is implied. I have a similar application where this, unsafe as it is, might be useful. I need to read in a large number of CSV files, with numeric columns, and group columns with high correlation together. I have yet to try my own suggestion though. I have tried the first approach I suggested, read in the contents into strings and work with the normal map-reduce. That does keep the handles open till map-reduce is done.
The austere interface precludes any of these issues, and therein lies the beauty.
Anish
Btw. I prefer the sawzall formulation over the map-reduce formulation. A sawzall program just specifies how to map some data to a monoid and the system is free to mappend the monoid values in whatever order it wants (by applying associativity).
Thanks for the tip, I wasn't aware of this. At least in the map-reduce formulation in this thread, the fold is completely specified by the user and not optimized further. So the sawazall formulation seems promising, because sometimes the fold action is more expensive than the map action. And while we are talking about different formulations, the code here http://article.gmane.org/gmane.comp.lang.haskell.cafe/41944 splits the list of operands into chunks and does the fold action in parallel on each chunk. This actually works well for me, though I find the semantics a little hard to remember. Anish
Tim Newsham http://www.thenewsh.com/~newsham/

On Mon, Mar 2, 2009 at 6:57 PM, Anish Muttreja
How about this. Is there a reason why I can't replace the variables b and c in the type signature of mapReduce with with (IO b') and (IO c'). b and c can be any types.
mapReduce :: Strategy (IO b') -- evaluation strategy for mapping -> (a -> IO b') -- map function -> Strategy (IO c') -- evaluation strategy for reduction -> ([IO b'] -> (IO c')) -- reduce function -> [a] -- list to map over -> (IO c')
Just remember to wrap all values back in the IO monad.
This is possible, but it probably won't do what you want. The input to
the reduce function will be a list of "IO a" values, not the results
of performing the IO.
--
Dave Menendez

Anish Muttreja ha scritto:
[...]
How about this. Is there a reason why I can't replace the variables b and c in the type signature of mapReduce with with (IO b') and (IO c'). b and c can be any types.
mapReduce :: Strategy (IO b') -- evaluation strategy for mapping -> (a -> IO b') -- map function -> Strategy (IO c') -- evaluation strategy for reduction -> ([IO b'] -> (IO c')) -- reduce function -> [a] -- list to map over -> (IO c')
Just remember to wrap all values back in the IO monad.
The other day I found, with google, a definizion of mapReduce, that make use of forkIO to execute piece of IO actions on separate threads. I can't find it anymore...
Anish
Manlio
participants (4)
-
Anish Muttreja
-
David Menendez
-
Manlio Perillo
-
Tim Newsham