
On Monday 10 October 2011, 10:13:45, Lorenzo Bolla wrote:
A part from style, is there any reason why one should prefer "where" over "let..in"?
Undoubtedly, some people have come up with theoretical reasons why one is generally better than the other. I haven't seen a convincing one yet, though. There are some cases where one is clearly better than the other, bindings scoping over guards, foo x y z | a == 0 = ... | b <= a = ... | a <= b+c = ... | otherwise = ... where a = notTooExpensive x y b = soSoExpensive z y c = reallyExpensive x y z are much easier to write and read with where. On the other hand, let binds in expr is an expression, so it's easily usable in contexts requiring an expression such as lambda abstractions and do-blocks, map (\x -> let a = h x in if b then foo a x else bar a x) list do a <- b c <- d let e = f a c bar e c baz foo a e where a where would be awkward at best (though the map example isn't particularly good, there map quux list where quux x = if b then foo a x else bar a x where a = h x isn't bad and becomes easier to understand if the lambda becomes complicated enough). So there are places where one is (clearly) more convenient/better than the other, but in other places it's purely up to personal preference.
You build the map via fromListWith (+). That's bad if you have input with large counts. fromListWith (+) gives you thunks of the form (...(1+1)+1)...+1) as map-entries. That costs a lot of space and time. Better build the map using insertWith',
mp = foldl' ins empty words where ins m w = insertWith' (+) w 1 m
Please find the "strict" version here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs.
One further time-saver (in printTokens): maxCount = maximum $ map count tokens you need not traverse the entire list of tokens for that, maxCount = case sortedTokens of (Token _ c:_) -> c
Plot of times is now: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png
Python is still slightly faster, but I'm getting there.
3. a Python version using defaultdict.
sort is O(n*log n), map and python are O(n*log d), where n is the number of words and d the number of distinct words.
I believe Python's defaultdict have (optimistically) O(1) lookup and insert, because it is implemented as a hashmap.
Ah, I tend to forget about hashmaps. Although the O(1) isn't quite true [calculating the hash isn't bounded time, and the lookup in the bucket need not be either], for a good hashtable it's true enough that it scales better than a tree-based map.