
Albert Y. C. Lai wrote:
I try using WordSet = [String] (plus corresponding change in code) and get great speedup, actually way more than 3x. There was also a memory growth phenomenon using Set String, and replacement by [String] stops that too, now it's constant space (constant = 20M). It is possible to attribute part of the speedup to excellent rewrite rules in GHC regarding lists; however, I cannot explain the memory growth when using Set.
Now I see. The higher memory usage of the Set version is not growth or leak, just brutal. For each input word, a ton of derived words are generated, and an optimal one is chosen (by a criterion that's essentially a mapping from words to scores). There are two ways to do this. You could put the derived words into a lazy list and then traverse it, and therefore they're generated, examined, and thrown away on demand. At most two words are ever going to be stored (the best one so far and the next candidate), maybe plus a couple of cons cells if the compiler does not optimize. Or you could put all derived words into a set first (to avoid duplicates, but note that you still have to generate a ton, you only save by storing just half a ton), then traverse it. Storing a ton, or even just half a ton, of these words is going to clog up memory. The perceived growth is only because some input words cause fewer derived words and some other input words encountered later cause more derived words. So, plotting memory over time, once in a while we have more derived words than before, so the heap grows for that; they are still thrown away correctly in due course. If you input the same input word over and over gain, space usage is still constant.