Re: [Haskell-cafe] Re: Haskell version of Norvig's Python Spelling Corrector

24 Apr 2007

      Albert Y. C. Lai wrote:
...
I try using WordSet = [String] (plus corresponding change in code) and 
get great speedup, actually way more than 3x. There was also a memory 
growth phenomenon using Set String, and replacement by [String] stops 
that too, now it's constant space (constant = 20M). It is possible to 
attribute part of the speedup to excellent rewrite rules in GHC 
regarding lists; however, I cannot explain the memory growth when using 
Set.
Now I see. The higher memory usage of the Set version is not growth or 
leak, just brutal. For each input word, a ton of derived words are 
generated, and an optimal one is chosen (by a criterion that's 
essentially a mapping from words to scores). There are two ways to do this.

You could put the derived words into a lazy list and then traverse it, 
and therefore they're generated, examined, and thrown away on demand. At 
most two words are ever going to be stored (the best one so far and the 
next candidate), maybe plus a couple of cons cells if the compiler does 
not optimize.

Or you could put all derived words into a set first (to avoid 
duplicates, but note that you still have to generate a ton, you only 
save by storing just half a ton), then traverse it. Storing a ton, or 
even just half a ton, of these words is going to clog up memory.
The perceived growth is only because some input words cause fewer 
derived words and some other input words encountered later cause more 
derived words. So, plotting memory over time, once in a while we have 
more derived words than before, so the heap grows for that; they are 
still thrown away correctly in due course. If you input the same input 
word over and over gain, space usage is still constant.