
I try using WordSet = [String] (plus corresponding change in code) and get great speedup, actually way more than 3x. There was also a memory growth phenomenon using Set String, and replacement by [String] stops that too, now it's constant space (constant = 20M). It is possible to attribute part of the speedup to excellent rewrite rules in GHC regarding lists; however, I cannot explain the memory growth when using Set. Regarding the local WordFreq map under "train", I am shocked that ghc -O is smart enough to notice it and perform proper sharing, and only one copy is ever created. Nonetheless, I still decide to factor "train" into two, one builds the WordFreq and the other queries it, which eases blame analysis when necessary. On the interact line, I use "tokens" to break up the input, since it's already written (for the trainer), may as well reuse it. When reading holmes.txt, be aware that it is in UTF-8, while GHC still assumes ISO-8859-1. This will affect results. I have not checked the correctness of edits1. I am monochrom. My modification is attached.