
Hi, On Sun, Oct 9, 2011 at 11:08 PM, Daniel Fischer < daniel.is.fischer@googlemail.com> wrote:
A few problems: 1) your input cleaning is not correct, running it over the source file, I get (among others)
Thanks, I'll fix it asap.
2) If you have a looong but rare word in the text, you will get too many spaces between the words and the markers, that's not good.
Thanks.
3) you calculate the number of '#'s using the total line length and then subtract the length of (word + spaces), skewing your output. You should calculate (spaceForHashes = lineWidth - maxLength - 1) and the use that to determine the number of hashes. [Although it should be more complicated really, maxLength should only include those words that are actually output.]
Fixed, thanks. Style is okay, although like Michael Xavier, I'm more of a where-guy.
A part from style, is there any reason why one should prefer "where" over "let..in"?
You build the map via fromListWith (+). That's bad if you have input with large counts. fromListWith (+) gives you thunks of the form (...(1+1)+1)...+1) as map-entries. That costs a lot of space and time. Better build the map using insertWith',
mp = foldl' ins empty words where ins m w = insertWith' (+) w 1 m
Please find the "strict" version here: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/lab1.hs. Plot of times is now: https://github.com/lbolla/stanford-cs240h/blob/master/lab1/times.png Python is still slightly faster, but I'm getting there.
3. a Python version using defaultdict. sort is O(n*log n), map and python are O(n*log d), where n is the number of words and d the number of distinct words.
I believe Python's defaultdict have (optimistically) O(1) lookup and insert, because it is implemented as a hashmap. Thanks everybody for your tips. They are most welcome. L.