On 20 April 2012 17:58, Lorenzo Bolla <lbolla@gmail.com> wrote:

Maybe you could also time with datasets of increasing size (up to 1M),
and see if the execution time grows like O(n^2), in which case I'd say
it's a hashing problem...

Testing with 1/4 and 1/2 the original data is definitely an excellent thing to do.

Also, In the original post it sounded as if **with the same data set**

1 - creating the hash with the data already in memory is fast

2 - reading the data is fast

3 - doing both is very, very slow

..but was this true? Or was the in-memory data set different? If it was different, then HOW was it different?

Other thought: you printed the last line to force the reading of the set in the file read test, ***but would this have forced the entire set to be converted into ByteStrings***? Even more than that - I'm no Haskell expert, but I can claim to have debugged a fair amount of weird code. Such enormously long read times are more typical of IO or, possibly, GC weirdness than any possible flaw in a hashing algorithm. In fact, it's almost impossible to see how an algorithmic flaw could generate such a vast runtime without provoking GC weirdness. And even then it would be pretty amazing.

So I'd try reading in the strings and converting to byte strings and then doing something simple to them that didn't involve hashing. Like xoring every byte in a byte string and then the result of every string with every other - some operation that definitely forces every bytestring to be present in bytestring form for sure.

You might open a system monitor and watch memory consumption, too.