Fwd: Implementing a spellchecker - problem with Data.HashTable performance

Hello fellow Haskellers, It's my first post to this list, so at first: Greetings to everyone! I've wanted to pick up Haskell for some time and I've run in to a perfect (at least it seemed so) oocasion at my Natural Language Processing course. At first I tried using Python, but it turned out that some extra performance wouldn't hurt and I hoped Haskell would give me some boost without forcing me to sacrifice high-level expressiveness at the same time. The first lesson was to use ByteStrings instead of ordinary Strings for IO. Then I needed to build a dictionary of words found in the source text. That's the problematic part - I've tried using a few structures, namely Data.Map and Data.HashTable from base package and Data.HashMap from unordered-containers. In all cases I couldn't reach acceptable performance. As of that, I'd like to ask about some advice and comments about the following examples. First example [1] just builds a HashTable of ~1 500 000 ByteStrings. The program takes about 11 seconds to execute. Second example [2] does roughly the same, but taking the HashTable data from a file (just a dictionary of valid Polish forms). Reading data itself is not a problem, as I've measured the execution time of a program which just reads the file (and prints its last line to actually force the read, hence "handle" Haskell's laziness) and it takes no more than a second. Coupled together reading and building, however, takes unbearably long. I didn't wait till the program stopped and just terminated it, but it had been running for ~10 minutes already. The most disappointing fact is that a roughly equivalent (and rather effortless to write in contrast to the Haskell example - I'm just learning though) Python test which just reads everything into a dict takes about 10 seconds to execute (reading and updating the dict of course). Writing the spellchecker in Python seemed pretty straightforward for me (but its certainly not the only language I know; I've had some Java (who didn't nowadays), system level C and am daily working in Erlang). I understand of course, that the most convenient tool is generally the one you know the best, but my experience writing the Haskell equivalent reached both extremes - effortless expression of the logic, but frustration because of the unexplainable (for me at least) performance behaviour. The question is why is it so slow? (or AM I RILY DAT DUMB? ;) The link to examples: [1] https://gist.github.com/2411699 TestHashTableArtificial [2] https://gist.github.com/2411699 TestReadIntoHashTable I'm looking forward to somebody shedding some light on this. Thanks in advance! Best regards, Radek Szymczyszyn

W dniu 20 kwietnia 2012 11:33 użytkownik Radosław Szymczyszyn
Hello fellow Haskellers,
It's my first post to this list, so at first: Greetings to everyone!
I've wanted to pick up Haskell for some time and I've run in to a perfect (at least it seemed so) oocasion at my Natural Language Processing course. At first I tried using Python, but it turned out that some extra performance wouldn't hurt and I hoped Haskell would give me some boost without forcing me to sacrifice high-level expressiveness at the same time.
The first lesson was to use ByteStrings instead of ordinary Strings for IO. Then I needed to build a dictionary of words found in the source text. That's the problematic part - I've tried using a few structures, namely Data.Map and Data.HashTable from base package and Data.HashMap from unordered-containers. In all cases I couldn't reach acceptable performance. As of that, I'd like to ask about some advice and comments about the following examples.
First example [1] just builds a HashTable of ~1 500 000 ByteStrings. The program takes about 11 seconds to execute. Second example [2] does roughly the same, but taking the HashTable data from a file (just a dictionary of valid Polish forms). Reading data itself is not a problem, as I've measured the execution time of a program which just reads the file (and prints its last line to actually force the read, hence "handle" Haskell's laziness) and it takes no more than a second. Coupled together reading and building, however, takes unbearably long. I didn't wait till the program stopped and just terminated it, but it had been running for ~10 minutes already.
The most disappointing fact is that a roughly equivalent (and rather effortless to write in contrast to the Haskell example - I'm just learning though) Python test which just reads everything into a dict takes about 10 seconds to execute (reading and updating the dict of course). Writing the spellchecker in Python seemed pretty straightforward for me (but its certainly not the only language I know; I've had some Java (who didn't nowadays), system level C and am daily working in Erlang). I understand of course, that the most convenient tool is generally the one you know the best, but my experience writing the Haskell equivalent reached both extremes - effortless expression of the logic, but frustration because of the unexplainable (for me at least) performance behaviour.
The question is why is it so slow? (or AM I RILY DAT DUMB? ;)
The link to examples: [1] https://gist.github.com/2411699 TestHashTableArtificial [2] https://gist.github.com/2411699 TestReadIntoHashTable
I'm looking forward to somebody shedding some light on this. Thanks in advance!
Best regards, Radek Szymczyszyn
Hi! Did you try with -O2? I've created file "formy.utf8" filled with 1400000 lines of word "żyźniejszymi" and here is my output: $ ghc --make -O2 TestReadIntoHashTable.hs -o test -rtsopts $ time ./test +RTS -K100m -RTS 1400000 real 0m3.265s user 0m3.097s sys 0m0.143s Best, Karol Samborski

I seem to recall that Data.HashTable is quite slow and not recommended for use. Try using Data.HashMap.Lazy/Strict from the unordered-containers package instead. -Brent On Fri, Apr 20, 2012 at 11:33:51AM +0200, Radosław Szymczyszyn wrote:
Hello fellow Haskellers,
It's my first post to this list, so at first: Greetings to everyone!
I've wanted to pick up Haskell for some time and I've run in to a perfect (at least it seemed so) oocasion at my Natural Language Processing course. At first I tried using Python, but it turned out that some extra performance wouldn't hurt and I hoped Haskell would give me some boost without forcing me to sacrifice high-level expressiveness at the same time.
The first lesson was to use ByteStrings instead of ordinary Strings for IO. Then I needed to build a dictionary of words found in the source text. That's the problematic part - I've tried using a few structures, namely Data.Map and Data.HashTable from base package and Data.HashMap from unordered-containers. In all cases I couldn't reach acceptable performance. As of that, I'd like to ask about some advice and comments about the following examples.
First example [1] just builds a HashTable of ~1 500 000 ByteStrings. The program takes about 11 seconds to execute. Second example [2] does roughly the same, but taking the HashTable data from a file (just a dictionary of valid Polish forms). Reading data itself is not a problem, as I've measured the execution time of a program which just reads the file (and prints its last line to actually force the read, hence "handle" Haskell's laziness) and it takes no more than a second. Coupled together reading and building, however, takes unbearably long. I didn't wait till the program stopped and just terminated it, but it had been running for ~10 minutes already.
The most disappointing fact is that a roughly equivalent (and rather effortless to write in contrast to the Haskell example - I'm just learning though) Python test which just reads everything into a dict takes about 10 seconds to execute (reading and updating the dict of course). Writing the spellchecker in Python seemed pretty straightforward for me (but its certainly not the only language I know; I've had some Java (who didn't nowadays), system level C and am daily working in Erlang). I understand of course, that the most convenient tool is generally the one you know the best, but my experience writing the Haskell equivalent reached both extremes - effortless expression of the logic, but frustration because of the unexplainable (for me at least) performance behaviour.
The question is why is it so slow? (or AM I RILY DAT DUMB? ;)
The link to examples: [1] https://gist.github.com/2411699 TestHashTableArtificial [2] https://gist.github.com/2411699 TestReadIntoHashTable
I'm looking forward to somebody shedding some light on this. Thanks in advance!
Best regards, Radek Szymczyszyn
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

While I think Misters Samborski and Yorgey have fixed the performance problem, I'm still curious about the apparent synergy effect:
First example [1] just builds a HashTable of ~1 500 000 ByteStrings.
The program takes about 11 seconds to execute.
Reading data itself is not a problem, as I've measured the execution time of a program which just reads the file (and prints its last line to actually force the read, hence "handle" Haskell's laziness) and it takes no more than a second.
Coupled together reading and building, however, takes unbearably long. I didn't wait till the program stopped and just terminated it, but it had been running for ~10 minutes already.
???

W dniu 20 kwietnia 2012 15:45 użytkownik umptious
While I think Misters Samborski and Yorgey have fixed the performance problem, I'm still curious about the apparent synergy effect:
Hi, I wrote about the example program of that effect last time. With -O2 flag it took 3 sek. Best, Karol Samborski

Thanks for your suggestions. Alas, they don't solve the problem. As I was at work without the original data file, I repeated the test suggested by Karol Samborski with a file of 1 400 000 repetitions of "żyźniejszymi". It took about 3.5s, so I thought my problem had been solved. However, repeating it with -O2 makes a difference of ~2-3s and I don't believe my laptop I used at home is *that much slower* than my Mac at work, that running without optimization would make such a great difference. Now, I've just rerun the test run with the original data file (still at work, so comparison with 3.5s is appropriate) at 17:26 and it's still running -- so the problem lies in the data set being hashed. I don't know why, but it seems to: - either make a difference whether one specific or many different words are hashed, - or whether it's just one slot or many of the HashTable being updated (but as I'm using newHint the space should be preallocated). Either way I would be grateful if you Karol or somebody else could rerun the test with the original data. It's available at: http://ernie.icslab.agh.edu.pl/~lavrin/formy.utf8.gz Thanks for your time! Regards, Radek Szymczyszyn

Maybe you could also time with datasets of increasing size (up to 1M), and see if the execution time grows like O(n^2), in which case I'd say it's a hashing problem... On Fri, Apr 20, 2012 at 06:03:19PM +0200, Radosław Szymczyszyn wrote:
Thanks for your suggestions. Alas, they don't solve the problem.
As I was at work without the original data file, I repeated the test suggested by Karol Samborski with a file of 1 400 000 repetitions of "żyźniejszymi". It took about 3.5s, so I thought my problem had been solved. However, repeating it with -O2 makes a difference of ~2-3s and I don't believe my laptop I used at home is *that much slower* than my Mac at work, that running without optimization would make such a great difference.
Now, I've just rerun the test run with the original data file (still at work, so comparison with 3.5s is appropriate) at 17:26 and it's still running -- so the problem lies in the data set being hashed. I don't know why, but it seems to: - either make a difference whether one specific or many different words are hashed, - or whether it's just one slot or many of the HashTable being updated (but as I'm using newHint the space should be preallocated).
Either way I would be grateful if you Karol or somebody else could rerun the test with the original data. It's available at: http://ernie.icslab.agh.edu.pl/~lavrin/formy.utf8.gz
Thanks for your time!
Regards, Radek Szymczyszyn
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
-- Lorenzo Bolla http://lbolla.info

On 20 April 2012 17:58, Lorenzo Bolla
Maybe you could also time with datasets of increasing size (up to 1M), and see if the execution time grows like O(n^2), in which case I'd say it's a hashing problem...
Testing with 1/4 and 1/2 the original data is definitely an excellent thing to do. Also, In the original post it sounded as if **with the same data set** 1 - creating the hash with the data already in memory is fast 2 - reading the data is fast 3 - doing both is very, very slow ..but was this true? Or was the in-memory data set different? If it was different, then HOW was it different? Other thought: you printed the last line to force the reading of the set in the file read test, ***but would this have forced the entire set to be converted into ByteStrings***? Even more than that - I'm no Haskell expert, but I can claim to have debugged a fair amount of weird code. Such enormously long read times are more typical of IO or, possibly, GC weirdness than any possible flaw in a hashing algorithm. In fact, it's almost impossible to see how an algorithmic flaw could generate such a vast runtime without provoking GC weirdness. And even then it would be pretty amazing. So I'd try reading in the strings and converting to byte strings and then doing something simple to them that didn't involve hashing. Like xoring every byte in a byte string and then the result of every string with every other - some operation that definitely forces every bytestring to be present in bytestring form for sure. You might open a system monitor and watch memory consumption, too.

Thank you all for thoughts and suggestions. I've been tracking them as they appeared, but being busy with university assignments, couldn't try them out yet. In the meantime, however, Karol Samborski investigated the issue further and found the cause of poor performance -- somehow using HashTable.newHint doesn't play well with HashTable.update. Simply changing newHint to hint gives me execution time of about 12s (on my rather slowish AMD Fusion 1.6GHz laptop) with a dataset of ~1.35mln words. It's evidently a bug, as newHint would hardly be expected to slow things down. I'm yet to discover how to properly report it. Thanks again to all who responded and especially to Karol for solving the problem. Best regards, Radek Szymczyszyn

Hi:
Your post is very interesting.
I'm also a beginner so can not offer you much suggestions.
Only that I saw this book the other day and think it might be helpful for
you.
Computational Semantics with Functional
Programminghttp://www.amazon.com/Computational-Semantics-Functional-Programming-Eijck/d...
This book use haskell as the programming language
Best regards
On Sun, Apr 22, 2012 at 4:58 PM, Radosław Szymczyszyn
Thank you all for thoughts and suggestions. I've been tracking them as they appeared, but being busy with university assignments, couldn't try them out yet.
In the meantime, however, Karol Samborski investigated the issue further and found the cause of poor performance -- somehow using HashTable.newHint doesn't play well with HashTable.update. Simply changing newHint to hint gives me execution time of about 12s (on my rather slowish AMD Fusion 1.6GHz laptop) with a dataset of ~1.35mln words.
It's evidently a bug, as newHint would hardly be expected to slow things down. I'm yet to discover how to properly report it.
Thanks again to all who responded and especially to Karol for solving the problem.
Best regards, Radek Szymczyszyn
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

On Sun, Apr 22, 2012 at 4:58 PM, Radosław Szymczyszyn
Thank you all for thoughts and suggestions. I've been tracking them as they appeared, but being busy with university assignments, couldn't try them out yet.
In the meantime, however, Karol Samborski investigated the issue further and found the cause of poor performance -- somehow using HashTable.newHint doesn't play well with HashTable.update. Simply changing newHint to hint gives me execution time of about 12s (on my rather slowish AMD Fusion 1.6GHz laptop) with a dataset of ~1.35mln words.
While it's good that you have found a bug that explain the very poor performance you got, I must reiterate Brent Yorgey's advice to use Data.HashMap from unordered-containers or some other performance oriented implementation : Data.HashTable is a legacy datastructure that was mainly introduced to have a hashtable in Haskell core libraries, it ran into some performance problem of GHC (only reduced in very recent release I believe) and was never written to be very fast in the first place... Avoiding it is generally a very good idea ! Using another structure (HashMap or some trie implementation) would probably get you much better performance without hassle. I must admit that having Data.HashTable in the base library without warning as to its unsuitability is very misleading to the beginner coming from Perl or Python. -- Jedaï
participants (7)
-
Brent Yorgey
-
Chaddaï Fouché
-
Karol Samborski
-
Lorenzo Bolla
-
Osager Prairie
-
Radosław Szymczyszyn
-
umptious