
Hello all: I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl] I have two hamming functions: -- hamming distance for variable length strings hamming :: String -> String -> Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1) -- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys)) I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS The costlier function is the hamming distance COST CENTRE MODULE %time %alloc hamming Distances 66.6 41.9 It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%. I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation. How can I execute my hamming functions without allocating memory? Best regards, Arnoldo Muller

Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
Hello all:
I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
I have two hamming functions: -- hamming distance for variable length strings hamming :: String -> String -> Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1)
-- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys))
I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS
The costlier function is the hamming distance COST CENTRE MODULE %time %alloc
hamming Distances 66.6 41.9
It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%.
Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ?
I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation.
Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too.
How can I execute my hamming functions without allocating memory?
Best regards,
Arnoldo Muller

Hello Daniel:
My % GC time is : 75.0% (81.4% elapsed) and I am compiling with -O2.
Thank you for clarifying about the pointers.
Slowly my memory grows up and eventually it explodes. I would expect that
the list comprehension is lazily evaluated and therefore at any given time I
am only executing one hamming distance. The result of the hamming distance
is stored into a small statistics datatype I built (only stores sums and sum
of squares and the counts). This datatype is updated using a foldr.
I have no idea where the leak is. What do you see in a .prof file to find a
leak (hamming distance has the largest amount of time and %alloc) ? From my
.prof file where would you start looking at?
Best Regards,
Arnoldo Muller
On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer
Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
Hello all:
I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
I have two hamming functions: -- hamming distance for variable length strings hamming :: String -> String -> Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1)
-- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys))
I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS
The costlier function is the hamming distance COST CENTRE MODULE %time %alloc
hamming Distances 66.6 41.9
It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%.
Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ?
I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation.
Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too.
How can I execute my hamming functions without allocating memory?
Best regards,
Arnoldo Muller

Hello all:
I found my leak after adding some bang patterns in a different part of the
program. The compiler was generating all the combinations of the list
comprehensions and therefore the performance dropped very badly.
BTW, hamming is 2 times faster than hamming2.
Thank you as always!
Arnoldo
On Mon, Apr 19, 2010 at 5:53 PM, Arnoldo Muller
Hello Daniel:
My % GC time is : 75.0% (81.4% elapsed) and I am compiling with -O2. Thank you for clarifying about the pointers.
Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr.
I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc) ? From my .prof file where would you start looking at?
Best Regards,
Arnoldo Muller
On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer
wrote: Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
Hello all:
I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
I have two hamming functions: -- hamming distance for variable length strings hamming :: String -> String -> Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c | x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1)
-- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys))
I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS
The costlier function is the hamming distance COST CENTRE MODULE %time %alloc
hamming Distances 66.6 41.9
It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%.
Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ?
I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation.
Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too.
How can I execute my hamming functions without allocating memory?
Best regards,
Arnoldo Muller

Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller:
Hello Daniel:
My % GC time is : 75.0% (81.4% elapsed) and I am compiling with -O2.
Very bad. Can I see the code?
Thank you for clarifying about the pointers.
Not to forget the Ints for counting.
Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr.
That might very well be the problem, if you update it with a foldr, you must construct the entire list of 2000^2 hamming-thunks before the work can begin. It's probably better to use foldl' (and make the type strict) so you can start the work immediately.
I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc)
For finding leaks, heap profiling (-h*) gives more info than -p. The .prof says more about where you spend your time than what hangs on to memory.
? From my .prof file where would you start looking at?
- use hamming instead of hamming2 - convertIntToDouble looks suspicious - calculating a few million Hamming distances takes some time, but what about getMyStats, should that really take 25%?
filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
filter (/= 0) [hamming xs ys | xs <- example, ys <- example] And of course, you can trivially avoid half of the work.
Best Regards,
Arnoldo Muller
On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer
Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
Hello all:
I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
I have two hamming functions: -- hamming distance for variable length strings hamming :: String -> String -> Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c
| x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1)
-- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys))
I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS
The costlier function is the hamming distance COST CENTRE MODULE %time %alloc
hamming Distances 66.6 41.9
It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%.
Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ?
I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation.
Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too.
How can I execute my hamming functions without allocating memory?
Best regards,
Arnoldo Muller

Daniel thank you for all your advice.
An additional ! bang pattern in convertIntToDouble fixed the issue! Also
using a foldl'
did the trick.
Now the program runs as it should with a constant amount of memory and in a
very small amount of time.
I believe these problems are one of the major sources of frustration for
Haskell newbies. Things that could work in <X> language easily suddenly
become problems in Haskell. When you overcome these issues then you feel
happy again that you chose Haskell as the main programming language of your
research project.
Is there any guide that explains more about the "bad consumption pattern".
Are there any general rules defined to avoid these issues? It helped me to
re-read the chapter on profiling in the Real World Haskell book to sorta
understand the problem. Is there a more detailed definition of the problem
than in RWH?
Regards,
Arnoldo
On Tue, Apr 20, 2010 at 2:49 AM, Daniel Fischer
Am Montag 19 April 2010 17:53:27 schrieb Arnoldo Muller:
Hello Daniel:
My % GC time is : 75.0% (81.4% elapsed) and I am compiling with -O2.
Very bad. Can I see the code?
Thank you for clarifying about the pointers.
Not to forget the Ints for counting.
Slowly my memory grows up and eventually it explodes. I would expect that the list comprehension is lazily evaluated and therefore at any given time I am only executing one hamming distance. The result of the hamming distance is stored into a small statistics datatype I built (only stores sums and sum of squares and the counts). This datatype is updated using a foldr.
That might very well be the problem, if you update it with a foldr, you must construct the entire list of 2000^2 hamming-thunks before the work can begin. It's probably better to use foldl' (and make the type strict) so you can start the work immediately.
I have no idea where the leak is. What do you see in a .prof file to find a leak (hamming distance has the largest amount of time and %alloc)
For finding leaks, heap profiling (-h*) gives more info than -p. The .prof says more about where you spend your time than what hangs on to memory.
? From my .prof file where would you start looking at?
- use hamming instead of hamming2 - convertIntToDouble looks suspicious - calculating a few million Hamming distances takes some time, but what about getMyStats, should that really take 25%?
filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
filter (/= 0) [hamming xs ys | xs <- example, ys <- example]
And of course, you can trivially avoid half of the work.
Best Regards,
Arnoldo Muller
On Mon, Apr 19, 2010 at 3:18 AM, Daniel Fischer
wrote: Am Montag 19 April 2010 01:03:14 schrieb Arnoldo Muller:
Hello all:
I want to generate some hamming distance statistics about a set of strings. As explained in another e-mail in this list, I used the following code to call the functions: (exampl holds the list of strings of size w) filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
I have two hamming functions: -- hamming distance for variable length strings hamming :: String -> String -> Int hamming x y = hamming' x y 0 where hamming' [] _ !c = c hamming' _ [] !c = c hamming' (x:xs) (y:ys) !c
| x == y = hamming' xs ys c | otherwise = hamming' xs ys (c + 1)
-- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys))
I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS
The costlier function is the hamming distance COST CENTRE MODULE %time %alloc
hamming Distances 66.6 41.9
It says that it is performing 41% of the allocations. In the case of hamming2 the allocations go as far as 52%.
Allocations are cheap, so that's not necessarily a problem. More important is, what's the maximum residency and how much is copied during GC? Are you compiling with -O2 ?
I could understand that there are allocations in "hamming2" because we are creating pairs, but in the case of "hamming" there should be no allocation.
Why not? I don't know how GHC counts allocations, but everytime you go from (x:xs) to xs, you need a new pointer to the tail. If that counts as allocation, hamming must allocate a lot, too.
How can I execute my hamming functions without allocating memory?
Best regards,
Arnoldo Muller

Arnoldo Muller wrote:
I believe these problems are one of the major sources of frustration for Haskell newbies. Things that could work in <X> language easily suddenly become problems in Haskell. When you overcome these issues then you feel happy again that you chose Haskell as the main programming language of your research project.
Well, the difference between <X> and Haskell is pretty much unavoidable. If you care about space and time usage, then there is no way around learning about lazy evaluation and Haskell's execution model.
Is there any guide that explains more about the "bad consumption pattern". Are there any general rules defined to avoid these issues? It helped me to re-read the chapter on profiling in the Real World Haskell book to sorta understand the problem. Is there a more detailed definition of the problem than in RWH?
Two of the most commonly occurring patterns are 1) foldl' vs foldl 2) average xs = sum xs / length xs vs average = uncurry (/) . foldl' (\(!s,!n) x -> (s+x,n+1)) (0,0) Other than that, most Haskell books offer a clear exposition of the reduction model. For instance, there is Graham Hutton. Programming in Haskell, chapter 12. Richard Bird. Introduction to Functional Programming using Haskell 2nd edition, chapter 7. The wikibook contains some preliminary material, too. http://en.wikibooks.org/wiki/Haskell/Graph_reduction Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Arnoldo Muller wrote:
I want to generate some hamming distance statistics about a set of strings.
filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
[...]
-- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys))
I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS
The costlier function is the hamming distance COST CENTRE MODULE %time %alloc
hamming Distances 66.6 41.9
Another way to look at it is that you shouldn't optimize hamming itself, but rather make sure that it's called less often! For instance, your expression can be replaced by filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs] which cuts the total running time in half. It's still quadratic in the length of example . I'm sure there are faster algorithms out there that can bring it down to O(n log n) if you want. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Am Montag 19 April 2010 14:13:53 schrieb Heinrich Apfelmus:
Arnoldo Muller wrote:
I want to generate some hamming distance statistics about a set of strings.
filter (\x -> x /= 0) $ map (uncurry hammingX) [(xs, ys) | xs <- exampl, ys <- exampl]
[...]
-- function posted in this mailing list hamming2 :: String -> String -> Int hamming2 xs ys = length (filter not (zipWith (==) xs ys))
I am executing these functions millions of times and the bottleneck of my program is in them as explained by running in profiling mode with +RTS -K400M -p -RTS
The costlier function is the hamming distance COST CENTRE MODULE %time %alloc
hamming Distances 66.6 41.9
Another way to look at it is that you shouldn't optimize hamming itself, but rather make sure that it's called less often!
For instance, your expression can be replaced by
filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs]
which cuts the total running time in half. It's still quadratic in the length of example . I'm sure there are faster algorithms out there that
If it's likely that there are many repetitions, collect the Strings in a Set/Map (depending on whether you're interested in the count) and call hamming only on the distinct pairs.
can bring it down to O(n log n) if you want.
I don't think so. You can't calculate the Hamming distance of x and z from the distances between x and y and y and z, so you'd have to consider all pairs of distinct strings, wouldn't you?
Regards, Heinrich Apfelmus

Daniel Fischer wrote:
Heinrich Apfelmus:
For instance, your expression can be replaced by
filter (/=0) [hammingX x y | (x:xs) <- tails example, y <- xs]
which cuts the total running time in half. It's still quadratic in the length of example . I'm sure there are faster algorithms out there that can bring it down to O(n log n) if you want.
I don't think so. You can't calculate the Hamming distance of x and z from the distances between x and y and y and z, so you'd have to consider all pairs of distinct strings, wouldn't you?
And there I was sure about something once, only to see that it's actually really doubtful... ;) The thing about the Hamming distance is that it's not a black box, so you can't get a lower bound by counting the number of minimum calls to hamming that have to be made. You are essentially arguing that the different Hamming distances are independent, which they are not. Not to mention that there are also "black-box" restrictions like the triangle inequality d(x,z) <= d(x,y) + d(y,z) but that one is probably harmless. In short, the situation is similar to how the sorting bound O(n*log n) does not apply to radix sort. Still, you are right to question my O(n*log n) claim; so far, my attempts at finding such an algorithm myself have failed. More precisely, the goal is to make a histogram of the possible hamming distances. We need at least O(n*w) time to do that, where n is the number of strings and w their maximum length; after all, we have to "touch" every character. For simplicity, that the characters are just one bit each. Furthermore, we can assume that w <= log n, otherwise there are lots of duplicate strings which can be grouped together. In this notation, the simple algorithm takes O(n^2*w) time. I did find a straightforward divide-and-conquer algorithm to tally small Hamming distances, but it's not good enough for the whole histogram. Here's the specification: countHemming :: Int -> [Bool] -> [Bool] countHemming d xs ys = length [() | x<-xs, y<-ys, hamming x y == d] In other words, countHemming d xs ys counts the number of pairings (x,y) whose Hamming distance is exactly d . Now, the idea is that this can be made faster for small d . For instance, for d == 0 , we are essentially just calculating the number of different elements of xs and ys . By requiring that xs and ys be sorted, this can be done in linear time countHemming 0 xs ys = ... a variation on merge xs ys And for slightly larger d , we can partition the lists by their first bits and continue recursively countHemming _ [] [] = 0 countHemming d xs ys = countHemming (d-1) x0 y1 + countHemming (d-1) x1 y0 + countHemming d x0 y0 + countHemming d x1 y1 where (x0,x1) = partitionByHead xs (y0,y1) = partitionByHead ys partitionByHead xs = (behead True xs, behead False xs) behead b xs = [bs | (b':bs) <- xs, b == b'] To estimate the running time, we set n = length (xs ++ ys) and let T(d,n) = running time of countHamming d xs ys We started with T(0,n) = O(n) and want to discuss the recursive case. The idea is that each list is divided in half, so we get T(d,n) = 2*T(d-1,n/2) + 2*T(d,n/2)
From this, we can calculate
T(1,n) = 2*T(0,n/2) + 2*T(1,n/2) = O(n) + 2*T(1,n/2) -- remember quicksort! = O(n*log n) T(2,n) = O(n*log n) + 2*T(2,n/2) = O(n*(log n)^2) and so on, yielding T(d,n) = O(n*(log n)^d) Alas, this can be used to search a dictionary while accounting for spelling errors, but it's no good to calculate a full histogram because it becomes prohibitively expensive when d ~ w/2 ~ (log n)/2 . Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com
participants (3)
-
Arnoldo Muller
-
Daniel Fischer
-
Heinrich Apfelmus