Comments on Map/Reduce Code

5 Jul 2012

      Hello there,

I'm new to Haskell and want to learn it a bit while experimenting with
Hadoop's Map/Reduce programming model. So, I wanted to implement the
standard ‘word counter’ problem in Haskell. The problem is as follows:

We have several texts with words separated by white-space. We want to
count the occurrences of all words in all the texts (where ‘but’ and
‘but,’ can be seen as two different words). This is done in two
phases. In the Map phase a program gets a part of the text from stdout
and has to produce a "KEY\tVALUE" pair (that is, the key separated with
a tab to the value), which has to be passed to stdin. In our case we
simply produce "WordX\t1" for every word WordX. This list is sorted by
the key and later on fed as stdin to the reducer (the second phase). The
reducer now has to sum up all the occurrences we trivially counted in
the Map phase and put it as "WordX\tNumber" to stdout.

So, here is an example:
vince@roku:~/tmp echo "foo foo bar bar foo bar zoo bar foo" | runhaskell mapper.hs 
foo	1                                                                         
foo	1
bar	1
bar	1
foo	1
bar	1
zoo	1
bar	1
foo	1
vince@roku:~/tmp echo "foo foo bar bar foo bar zoo bar foo" | runhaskell mapper.hs | sort | runhaskell reducer.hs
bar	4
foo	4
zoo	1

And here is the code I've come up with:

vince@roku:~/tmp cat mapper.hs
import qualified Data.ByteString.Lazy.Char8 as C

postFix :: C.ByteString
postFix = C.pack "\t1"

formatter :: C.ByteString -> C.ByteString
formatter x = C.append x postFix

main :: IO ()
main = do
  contents <- fmap C.words C.getContents
  C.putStr . C.unlines $ map formatter contents

vince@roku:~/tmp cat reducer.hs
import qualified Data.ByteString.Lazy.Char8 as C
import qualified Data.List as L

tuppleize :: String -> (String, Int)
tuppleize line = (\xs -> (head xs, read (last xs))) $ words line

group :: Eq a => [(a, b)] -> [[(a, b)]]
group = L.groupBy (\x y -> fst x == fst y)

summation :: Num b => [(a, b)] -> (a, b)
summation (x:[]) = x
summation (x:xs) = (fst x, (snd x) + (snd (summation xs)))

formatter :: (String, Int) -> String
formatter = (\w -> (fst w ++ "\t" ++ show (snd w)))

main = do
  contents <- C.getContents
  putStr . unlines $ map formatter $ map summation $ group $ map tuppleize $ lines $ C.unpack contents

As already said, I'm a Haskell beginner. Could you provide some comments
on the code?

Thanks in advance,

       Thomas.

Thomas Bach

Brent Yorgey

Thomas Bach

Brent Yorgey

Thomas Bach

Brent Yorgey

tags

participants (2)