
Hello there, I'm new to Haskell and want to learn it a bit while experimenting with Hadoop's Map/Reduce programming model. So, I wanted to implement the standard ‘word counter’ problem in Haskell. The problem is as follows: We have several texts with words separated by white-space. We want to count the occurrences of all words in all the texts (where ‘but’ and ‘but,’ can be seen as two different words). This is done in two phases. In the Map phase a program gets a part of the text from stdout and has to produce a "KEY\tVALUE" pair (that is, the key separated with a tab to the value), which has to be passed to stdin. In our case we simply produce "WordX\t1" for every word WordX. This list is sorted by the key and later on fed as stdin to the reducer (the second phase). The reducer now has to sum up all the occurrences we trivially counted in the Map phase and put it as "WordX\tNumber" to stdout. So, here is an example: vince@roku:~/tmp echo "foo foo bar bar foo bar zoo bar foo" | runhaskell mapper.hs foo 1 foo 1 bar 1 bar 1 foo 1 bar 1 zoo 1 bar 1 foo 1 vince@roku:~/tmp echo "foo foo bar bar foo bar zoo bar foo" | runhaskell mapper.hs | sort | runhaskell reducer.hs bar 4 foo 4 zoo 1 And here is the code I've come up with: vince@roku:~/tmp cat mapper.hs import qualified Data.ByteString.Lazy.Char8 as C postFix :: C.ByteString postFix = C.pack "\t1" formatter :: C.ByteString -> C.ByteString formatter x = C.append x postFix main :: IO () main = do contents <- fmap C.words C.getContents C.putStr . C.unlines $ map formatter contents vince@roku:~/tmp cat reducer.hs import qualified Data.ByteString.Lazy.Char8 as C import qualified Data.List as L tuppleize :: String -> (String, Int) tuppleize line = (\xs -> (head xs, read (last xs))) $ words line group :: Eq a => [(a, b)] -> [[(a, b)]] group = L.groupBy (\x y -> fst x == fst y) summation :: Num b => [(a, b)] -> (a, b) summation (x:[]) = x summation (x:xs) = (fst x, (snd x) + (snd (summation xs))) formatter :: (String, Int) -> String formatter = (\w -> (fst w ++ "\t" ++ show (snd w))) main = do contents <- C.getContents putStr . unlines $ map formatter $ map summation $ group $ map tuppleize $ lines $ C.unpack contents As already said, I'm a Haskell beginner. Could you provide some comments on the code? Thanks in advance, Thomas.

On Thu, Jul 05, 2012 at 03:50:32PM +0200, Thomas Bach wrote:
Hello there,
Hi Thomas, Looks pretty good. I've interspersed a few comments below.
And here is the code I've come up with:
vince@roku:~/tmp cat mapper.hs import qualified Data.ByteString.Lazy.Char8 as C
postFix :: C.ByteString postFix = C.pack "\t1"
formatter :: C.ByteString -> C.ByteString formatter x = C.append x postFix
main :: IO () main = do contents <- fmap C.words C.getContents C.putStr . C.unlines $ map formatter contents
The above looks fine, except that generally the recommendation is to use the text package [1] for dealing with text, whereas ByteString is for binary data that you wish to manipulate as a sequence of bytes. You can get away with the above only when the text consists entirely of ASCII characters. [1] http://hackage.haskell.org/package/text
vince@roku:~/tmp cat reducer.hs import qualified Data.ByteString.Lazy.Char8 as C import qualified Data.List as L
tuppleize :: String -> (String, Int) tuppleize line = (\xs -> (head xs, read (last xs))) $ words line
What happens when the line is empty?
group :: Eq a => [(a, b)] -> [[(a, b)]] group = L.groupBy (\x y -> fst x == fst y)
The above lambda can also be written as ((==) `on` fst). 'on' can be imported from Data.Function.
summation :: Num b => [(a, b)] -> (a, b) summation (x:[]) = x summation (x:xs) = (fst x, (snd x) + (snd (summation xs)))
Instead of using (fst x) and (snd x), you should pattern-match on x, like summation ((x,y):xs) = (x, y + ...)
formatter :: (String, Int) -> String formatter = (\w -> (fst w ++ "\t" ++ show (snd w)))
The same goes here. I would also put the w argument on the left-hand side of the =, like formatter (s,i) = s ++ "\t" ++ show i
main = do contents <- C.getContents putStr . unlines $ map formatter $ map summation $ group $ map tuppleize $ lines $ C.unpack contents
Instead of using a chain of ($), it's generally considered better style to use a chain of (.) with a single $ at the end, like putStr . unlines . map formatter ... lines . C.unpack $ contents -Brent

Hi Brent, thanks for your suggestions. I got further suggestions as a private e-mail and finally came up with the following solution – suggestions welcome: ==================== mapper.hs ==================== import qualified Data.Text.Lazy as TL import qualified Data.Text.Lazy.IO as TLIO postFix :: TL.Text postFix = TL.pack "\t1" formatter :: TL.Text -> TL.Text formatter x = TL.append x postFix main = TLIO.interact pipeline where words = TL.words format = map formatter pipeline = TL.unlines . format . words =================================================== ==================== reducer.hs =================== import qualified Data.Text.Lazy.IO as TLIO import qualified Data.Text.Lazy as TL import qualified Data.List as L import Data.Function (on) tuppleize :: TL.Text -> (TL.Text, Int) tuppleize line = (word, num) where words = TL.words line word = head words num = read . TL.unpack $ last words group :: Eq a => [(a, b)] -> [[(a, b)]] group = L.groupBy ((==) `on` fst) summation :: Num b => [(a, b)] -> (a, b) summation ((x, y):xs) = (x, y + sum (map snd xs)) formatter :: (TL.Text, Int) -> TL.Text formatter (w, i) = TL.append w (TL.pack ('\t':show i)) main = TLIO.interact pipeline where tuples = TL.lines group_tuples = group . map tuppleize sum_tuples = map (formatter . summation) pipeline = TL.unlines . sum_tuples . group_tuples . tuples =================================================== The reducer still throws an error when piping in an empty newline. But, I'm not sure, what a proper solution for this could be. On Tue, Jul 10, 2012 at 10:19:48AM -0400, Brent Yorgey wrote:
On Thu, Jul 05, 2012 at 03:50:32PM +0200, Thomas Bach wrote:
Instead of using a chain of ($), it's generally considered better style to use a chain of (.) with a single $ at the end, like
I often find myself trying different combinations of ($), (.) or putting things in brackets, when the compiler throws an error on me. Without really knowing why things sometimes work out and sometimes don't. I find this part especially confusing! Regards, Thomas

On Fri, Jul 13, 2012 at 06:02:02PM +0200, Thomas Bach wrote:
Hi Brent,
thanks for your suggestions. I got further suggestions as a private e-mail and finally came up with the following solution – suggestions welcome:
Looks good!
The reducer still throws an error when piping in an empty newline. But, I'm not sure, what a proper solution for this could be.
The problem is your 'summation' function: summation :: Num b => [(a, b)] -> (a, b) In fact, it is impossible to implement something with this type which works for all inputs. If you get the empty list as input, there is no way to make up a value of type 'a' in the output tuple. Now, if you happen to know a more concrete type for 'a' then you can do it by using some sort of default value. For example, in your case it looks like 'a' will actually be TL.Text. So you could write summation :: Num b => [(TL.Text, b)] -> (TL.Text, b) by doing something like returning the empty string along with 0 in the case that the input list is empty. In this particular case it is probably not that big of a deal. It is just good practice to try to always write *total* functions, that is, functions which give a sensible result (i.e. do not crash) for every possible input. This means doing things like (1) make sure you have covered every possible case when pattern-matching; (2) do not use functions which can crash like 'head' or 'tail' (use pattern-matching instead).
On Tue, Jul 10, 2012 at 10:19:48AM -0400, Brent Yorgey wrote:
On Thu, Jul 05, 2012 at 03:50:32PM +0200, Thomas Bach wrote:
Instead of using a chain of ($), it's generally considered better style to use a chain of (.) with a single $ at the end, like
I often find myself trying different combinations of ($), (.) or putting things in brackets, when the compiler throws an error on me. Without really knowing why things sometimes work out and sometimes don't. I find this part especially confusing!
This can be confusing at first. The solution is (1) make sure you understand how parsing/operator precedence work, and (2) think hard about the types of operators like ($) and (.), and what any type errors are telling you. You really will not learn much by just randomly trying to insert ($), (.) or parentheses. -Brent

On Sat, Jul 14, 2012 at 08:35:44PM -0400, Brent Yorgey wrote:
On Fri, Jul 13, 2012 at 06:02:02PM +0200, Thomas Bach wrote:
The reducer still throws an error when piping in an empty newline. But, I'm not sure, what a proper solution for this could be.
The problem is your 'summation' function:
summation :: Num b => [(a, b)] -> (a, b)
In fact, it is impossible to implement something with this type which works for all inputs. If you get the empty list as input, there is no way to make up a value of type 'a' in the output tuple.
It really is not that big of a deal. As I'd assume that Hadoop guarantees that at least one line will be passed to the reducer. But, just out of curiosity: wouldn't this be a case where monads are applied? Say, `Maybe'? So that the type becomes summation :: Num b => [(a, b)] -> Maybe (a, b) Or is the semantic behind monads a different one? Regards, Thomas

On Tue, Jul 17, 2012 at 03:01:39PM +0200, Thomas Bach wrote:
On Sat, Jul 14, 2012 at 08:35:44PM -0400, Brent Yorgey wrote:
On Fri, Jul 13, 2012 at 06:02:02PM +0200, Thomas Bach wrote:
The reducer still throws an error when piping in an empty newline. But, I'm not sure, what a proper solution for this could be.
The problem is your 'summation' function:
summation :: Num b => [(a, b)] -> (a, b)
In fact, it is impossible to implement something with this type which works for all inputs. If you get the empty list as input, there is no way to make up a value of type 'a' in the output tuple.
It really is not that big of a deal. As I'd assume that Hadoop guarantees that at least one line will be passed to the reducer. But, just out of curiosity: wouldn't this be a case where monads are applied? Say, `Maybe'? So that the type becomes
summation :: Num b => [(a, b)] -> Maybe (a, b)
Yes, wrapping the return type in Maybe could be a good idea indeed. You say "wouldn't this be a case where monads are applied", but you are jumping too far ahead: using Maybe in this way is a good idea no matter whether you happen to know that Maybe is an instance of Monad or not. Likewise, you could work with a function of this type without using the Monad interface at all. Using the Monad interface might simplify some of the code but it is by no means required. -Brent
participants (2)
-
Brent Yorgey
-
Thomas Bach