How would you improve this program?

newer
first open source haskell project...

older
tokenizing a string and parsing...

Lorenzo Bolla

9 Oct 2011 9 Oct '11

8:11 p.m.

Hi all, I'm new to Haskell and I'd like you to take a look at one of my programs and tell me how you would improve it (in terms of efficiency, style, and so on!). The source code is here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs The program is an implementation of this problem: http://www.scs.stanford.edu/11au-cs240h/labs/lab1.html (basically, counting how many times a word appear in a text.) (I'm not a Stanford student, so by helping me out you won't help me to cheat my exam, don't worry!) I've implemented 3 versions of the algorithm: 1. a Haskell version using the standard "sort": read all the words from stdin, sort them and group them. 2. a Haskell version using map: read all the words from stdin, stick each word in a Data.Map incrementing a counter if the word is already present in the map. 3. a Python version using defaultdict. I timed the different versions and the results are here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png. The python version is the quickest (I stripped out the fancy formatting before benchmarking, so IO is not responsible for the time difference). Any comments on the graph, too? Thanks a lot! L.

Attachments:

attachment.html (text/html — 10.0 KB)

Show replies by date

Michael Xavier

9 Oct 9 Oct

8:52 p.m.

This is a stylistic issue and it is certainly objective but I personally prefer using "where" a lot as opposed to let ... in and lambdas. You use lambdas inline with maps a lot. That is perfectly valid and your lambdas are pretty simple, but I find it easier to read to figure out a function name that is descriptive and put it in a where rather than a lambda. That way, someone else (or perhaps you at a later date) can easily scan through the function definitions and read what is being done in more or less plain english. Likewise, my issue with let .. in most of the time is just that it seems backwards to me. I prefer to read the broad strokes of code first and the specifics later, which is why where is more attractive to me. It otherwise looks like pretty clean code to me. Certainly cleaner than the stuff I started out with. Just my two cents. On Sun, Oct 9, 2011 at 1:11 PM, Lorenzo Bolla wrote:

...

Hi all, I'm new to Haskell and I'd like you to take a look at one of my programs and tell me how you would improve it (in terms of efficiency, style, and so on!).

The source code is here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs The program is an implementation of this problem: http://www.scs.stanford.edu/11au-cs240h/labs/lab1.html (basically, counting how many times a word appear in a text.) (I'm not a Stanford student, so by helping me out you won't help me to cheat my exam, don't worry!)

I've implemented 3 versions of the algorithm:

1. a Haskell version using the standard "sort": read all the words from stdin, sort them and group them. 2. a Haskell version using map: read all the words from stdin, stick each word in a Data.Map incrementing a counter if the word is already present in the map. 3. a Python version using defaultdict.

I timed the different versions and the results are here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png. The python version is the quickest (I stripped out the fancy formatting before benchmarking, so IO is not responsible for the time difference). Any comments on the graph, too?

Thanks a lot! L.

_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

-- Michael Xavier http://www.michaelxavier.net LinkedIn http://www.linkedin.com/pub/michael-xavier/13/b02/a26

Chaddaï Fouché

9:40 p.m.

On Sun, Oct 9, 2011 at 10:11 PM, Lorenzo Bolla wrote:

...

I've implemented 3 versions of the algorithm:

a Haskell version using the standard "sort": read all the words from stdin, sort them and group them. a Haskell version using map: read all the words from stdin, stick each word in a Data.Map incrementing a counter if the word is already present in the map.

You seem to be using fromListWith which is lazy, in other words you're accumulating enormous (1 + 1 + ... + 1) thunks in your map especially with the higher word count tests. It might be that GHC is optimizing that for you but I strongly doubt it. You should probably replace it by a stricter version :

...

fromListWith' f xs = foldl' ins empty xs where ins m (k,x) = insertWith' f k x m

I hope that helps, otherwise it seems pretty good (but I only gave it a passing glance). -- Jedaï

Daniel Fischer

10:08 p.m.

On Sunday 09 October 2011, 22:11:35, Lorenzo Bolla wrote:

...

Hi all, I'm new to Haskell and I'd like you to take a look at one of my programs and tell me how you would improve it (in terms of efficiency, style, and so on!).

The source code is here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs The program is an implementation of this problem: http://www.scs.stanford.edu/11au-cs240h/labs/lab1.html (basically, counting how many times a word appear in a text.) (I'm not a Stanford student, so by helping me out you won't help me to cheat my exam, don't worry!)

A few problems: 1) your input cleaning is not correct, running it over the source file, I get (among others) -- ################################################# -> ################################## ############################# $ ######################## Now, they say they aren't going to be tricky, so most of this isn't necessary to take care of, but dashes/hyphens -- e.g. for parenthetical remarks -- should probably be dealt with---also the style with one em-dash not surrounded by spaces. Also quotes, "Bob said: 'I will go there!'", Alice told the listeners. 2) If you have a looong but rare word in the text, you will get too many spaces between the words and the markers, that's not good. 3) you calculate the number of '#'s using the total line length and then subtract the length of (word + spaces), skewing your output. You should calculate (spaceForHashes = lineWidth - maxLength - 1) and the use that to determine the number of hashes. [Although it should be more complicated really, maxLength should only include those words that are actually output.] Style is okay, although like Michael Xavier, I'm more of a where-guy.

...

I've implemented 3 versions of the algorithm:

1. a Haskell version using the standard "sort": read all the words from stdin, sort them and group them.

Not much you can do there.

...

2. a Haskell version using map: read all the words from stdin, stick each word in a Data.Map incrementing a counter if the word is already present in the map.

You build the map via fromListWith (+). That's bad if you have input with large counts. fromListWith (+) gives you thunks of the form (...(1+1)+1)...+1) as map-entries. That costs a lot of space and time. Better build the map using insertWith', mp = foldl' ins empty words where ins m w = insertWith' (+) w 1 m

...

3. a Python version using defaultdict.

I timed the different versions and the results are here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png. The python version is the quickest (I stripped out the fancy formatting before benchmarking, so IO is not responsible for the time difference). Any comments on the graph, too?

sort is O(n*log n), map and python are O(n*log d), where n is the number of words and d the number of distinct words. In the range of the graph both look quite linear. The number of distinct words is much smaller than the number of words, so that's a smaller 'constant' (logarithmic) factor. Python seems to have a smaller factor than map, I'm not sure whether that's because Python can mutate the dictionary in-place while Haskell's immutable Map requires some copying on each insert or whether it's due to the thunks and the excessive space usage caused by them. Would be interesting to see the graph for the more efficient map-building.

Lorenzo Bolla

10 Oct 10 Oct

8:13 a.m.

Hi, On Sun, Oct 9, 2011 at 11:08 PM, Daniel Fischer < daniel.is.fischer@googlemail.com> wrote:

...

A few problems: 1) your input cleaning is not correct, running it over the source file, I get (among others)

Thanks, I'll fix it asap.

...

2) If you have a looong but rare word in the text, you will get too many spaces between the words and the markers, that's not good.

Thanks.

...

3) you calculate the number of '#'s using the total line length and then subtract the length of (word + spaces), skewing your output. You should calculate (spaceForHashes = lineWidth - maxLength - 1) and the use that to determine the number of hashes. [Although it should be more complicated really, maxLength should only include those words that are actually output.]

Fixed, thanks. Style is okay, although like Michael Xavier, I'm more of a where-guy.

...

A part from style, is there any reason why one should prefer "where" over "let..in"?

...

You build the map via fromListWith (+). That's bad if you have input with large counts. fromListWith (+) gives you thunks of the form (...(1+1)+1)...+1) as map-entries. That costs a lot of space and time. Better build the map using insertWith',

mp = foldl' ins empty words where ins m w = insertWith' (+) w 1 m

Please find the "strict" version here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs. Plot of times is now: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png Python is still slightly faster, but I'm getting there.

...

3. a Python version using defaultdict. sort is O(n*log n), map and python are O(n*log d), where n is the number of words and d the number of distinct words.

I believe Python's defaultdict have (optimistically) O(1) lookup and insert, because it is implemented as a hashmap. Thanks everybody for your tips. They are most welcome. L.

Daniel Fischer

12:11 p.m.

On Monday 10 October 2011, 10:13:45, Lorenzo Bolla wrote:

...

A part from style, is there any reason why one should prefer "where" over "let..in"?

Undoubtedly, some people have come up with theoretical reasons why one is generally better than the other. I haven't seen a convincing one yet, though. There are some cases where one is clearly better than the other, bindings scoping over guards, foo x y z | a == 0 = ... | b <= a = ... | a <= b+c = ... | otherwise = ... where a = notTooExpensive x y b = soSoExpensive z y c = reallyExpensive x y z are much easier to write and read with where. On the other hand, let binds in expr is an expression, so it's easily usable in contexts requiring an expression such as lambda abstractions and do-blocks, map (\x -> let a = h x in if b then foo a x else bar a x) list do a <- b c <- d let e = f a c bar e c baz foo a e where a where would be awkward at best (though the map example isn't particularly good, there map quux list where quux x = if b then foo a x else bar a x where a = h x isn't bad and becomes easier to understand if the lambda becomes complicated enough). So there are places where one is (clearly) more convenient/better than the other, but in other places it's purely up to personal preference.

...

...
You build the map via fromListWith (+). That's bad if you have input with large counts. fromListWith (+) gives you thunks of the form (...(1+1)+1)...+1) as map-entries. That costs a lot of space and time. Better build the map using insertWith',

mp = foldl' ins empty words where ins m w = insertWith' (+) w 1 m

Please find the "strict" version here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs.

One further time-saver (in printTokens): maxCount = maximum $ map count tokens you need not traverse the entire list of tokens for that, maxCount = case sortedTokens of (Token _ c:_) -> c

...

Plot of times is now: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png

Python is still slightly faster, but I'm getting there.

...
3. a Python version using defaultdict.

sort is O(n*log n), map and python are O(n*log d), where n is the number of words and d the number of distinct words.

I believe Python's defaultdict have (optimistically) O(1) lookup and insert, because it is implemented as a hashmap.

Ah, I tend to forget about hashmaps. Although the O(1) isn't quite true [calculating the hash isn't bounded time, and the lookup in the bucket need not be either], for a good hashtable it's true enough that it scales better than a tree-based map.

Ovidiu Deac

11 Oct 11 Oct

11:04 a.m.

You have your main function which calls printTokens which calls again printToken. Both of the print* functions are IO (). I see that as being bad style. It would be cleaner if you would build the string to be printed in some nice pure functions and then print that string in main directly. That way you isolate the IO actions in the main function only. I'm thinking something like this: tokenToString :: Int -> Int -> Token -> String tokenToString maxLength maxCount (Token w c) = ... tokenListToString :: [Token] -> String tokenListToString tokens = join "\n" result -- from Data.List.Utils where result = map (tokenToString maxLength maxCount) sortedTokens ... main = do words <- getContents let output = tokenListToString $ countTokens words putStr output This way your function types actually mean something instead of just having functions IO () which could do whatever they like ovidiu On Sun, Oct 9, 2011 at 11:11 PM, Lorenzo Bolla wrote:

...

Hi all, I'm new to Haskell and I'd like you to take a look at one of my programs and tell me how you would improve it (in terms of efficiency, style, and so on!). The source code is here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs The program is an implementation of this problem: http://www.scs.stanford.edu/11au-cs240h/labs/lab1.html (basically, counting how many times a word appear in a text.) (I'm not a Stanford student, so by helping me out you won't help me to cheat my exam, don't worry!) I've implemented 3 versions of the algorithm:

a Haskell version using the standard "sort": read all the words from stdin, sort them and group them. a Haskell version using map: read all the words from stdin, stick each word in a Data.Map incrementing a counter if the word is already present in the map. a Python version using defaultdict.

I timed the different versions and the results are here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png. The python version is the quickest (I stripped out the fancy formatting before benchmarking, so IO is not responsible for the time difference). Any comments on the graph, too? Thanks a lot! L. _______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Brent Yorgey

3:28 p.m.

On Tue, Oct 11, 2011 at 02:04:23PM +0300, Ovidiu Deac wrote:

...

join "\n" result -- from Data.List.Utils

By the way, 'join "\n" result' is better written 'unlines result' (and 'join' is better written 'intercalate'). (Otherwise I completely agree with your email.) -Brent

Lorenzo Bolla

12 Oct 12 Oct

3:25 p.m.

On Tue, Oct 11, 2011 at 4:28 PM, Brent Yorgey wrote:

...

On Tue, Oct 11, 2011 at 02:04:23PM +0300, Ovidiu Deac wrote:

...
join "\n" result -- from Data.List.Utils

By the way, 'join "\n" result' is better written 'unlines result' (and 'join' is better written 'intercalate'). (Otherwise I completely agree with your email.)

Fixed. https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs Thanks for the suggestions. L.

5029

Age (days ago)

5032

Last active (days ago)

List overview

Download

8 comments

6 participants

participants (6)

Brent Yorgey
Chaddaï Fouché
Daniel Fischer
Lorenzo Bolla
Michael Xavier
Ovidiu Deac