Re: [Haskell-cafe] hamming distance allocation

Subject: Re: [Haskell-cafe] hamming distance allocation
Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
Hello all:
I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
I have two hamming functions: -- hamming distance for variable length strings hamming :: String -> String -> Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1)
-- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys))
I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS
The costlier function is the hamming distance COST CENTRE MODULE %time %alloc
hamming Distances 66.6 41.9
It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%.
Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ?
I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation.
Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too.
Is it really necessary to use Strings? I think a packed type, e.g. Vector or ByteString, would be much more efficient here. Of course this is only likely to be a benefit if you can move away from String entirely. I suspect that "hamming2" would perform better then. John

Am Montag 19 April 2010 14:37:33 schrieb John Lato:
Is it really necessary to use Strings? I think a packed type, e.g. Vector or ByteString, would be much more efficient here.
Not very much if the strings are fairly short (and the list isn't too long, so there's not a big difference in cache-friendliness). If eight-bit characters aren't enough, packing the strings into UArray Int Char gives performance quite close to ByteStrings.
Of course this is only likely to be a benefit if you can move away from String entirely.
I suspect that "hamming2" would perform better then.
John

The strings will not be longer than 30 characters.
I am doing sets of 2000 (total of 2000^2 distance computations)
I am expecting that all the operations will be lazyly performed but at some
point I get a memory error.
Most of the memory is being allocated for the hamming distance and I am
still unable to find the source of my memory leak.
Regards,
Arnoldo
On Mon, Apr 19, 2010 at 3:47 PM, Daniel Fischer
Am Montag 19 April 2010 14:37:33 schrieb John Lato:
Is it really necessary to use Strings? I think a packed type, e.g. Vector or ByteString, would be much more efficient here.
Not very much if the strings are fairly short (and the list isn't too long, so there's not a big difference in cache-friendliness). If eight-bit characters aren't enough, packing the strings into UArray Int Char gives performance quite close to ByteStrings.
Of course this is only likely to be a benefit if you can move away from String entirely.
I suspect that "hamming2" would perform better then.
John
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am Montag 19 April 2010 17:17:11 schrieb Arnoldo Muller:
The strings will not be longer than 30 characters.
For 20 -30 character strings, using ByteStrings should be better, in my tests about 40% faster, allocation figures slightly lower, resident memory much lower and bytes copied during GC very much lower. For a sample of english text (many short words, few long), ByteStrings were about 25% faster, allocation figures very slightly lower, resident memory much lower, bytes copied much lower (relative difference not as large as for longer strings).
I am doing sets of 2000 (total of 2000^2 distance computations)
That shouldn't give memory problems either way.
I am expecting that all the operations will be lazyly performed but at some point I get a memory error.
My guess is a bad consumption pattern.
Most of the memory is being allocated for the hamming distance and I am still unable to find the source of my memory leak.
Allocation as such is not a problem, resident memory is the important thing. Try heap profiling to see what holds on to memory (+RTS -hc would be a good first run).
Regards,
Arnoldo
On Mon, Apr 19, 2010 at 3:47 PM, Daniel Fischer
Am Montag 19 April 2010 14:37:33 schrieb John Lato:
Is it really necessary to use Strings? I think a packed type, e.g. Vector or ByteString, would be much more efficient here.
Not very much if the strings are fairly short (and the list isn't too long, so there's not a big difference in cache-friendliness). If eight-bit characters aren't enough, packing the strings into UArray Int Char gives performance quite close to ByteStrings.
Of course this is only likely to be a benefit if you can move away from String entirely.
I suspect that "hamming2" would perform better then.
John

Hello John:
Well I could use a packed type. The only letters that will be found in the
string are "ATCG" so yeah I don't need unicode and those things.
Will try out with vector or ByteString. Thanks! :)
On Mon, Apr 19, 2010 at 2:37 PM, John Lato
Subject: Re: [Haskell-cafe] hamming distance allocation
Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
Hello all:
I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
I have two hamming functions: -- hamming distance for variable length strings hamming :: String -> String -> Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1)
-- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys))
I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS
The costlier function is the hamming distance COST CENTRE MODULE %time %alloc
hamming Distances 66.6 41.9
It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%.
Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ?
I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation.
Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too.
Is it really necessary to use Strings? I think a packed type, e.g. Vector or ByteString, would be much more efficient here. Of course this is only likely to be a benefit if you can move away from String entirely.
I suspect that "hamming2" would perform better then.
John _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (3)
-
Arnoldo Muller
-
Daniel Fischer
-
John Lato