Help optimising a Haskell program

Hi, I have a Haskell program I'm trying to optimise, and could use some assistance. It's for rank aggregation - taking a bunch of partial rankings of some items from users and turning them into an overall ranking (aka "That thing that Hammer Principle does"). The code is here: https://github.com/DRMacIver/hs-rank-aggregation. The interesting code lives in the Algorithms.RankAggregation module. The sample data I'm running it on is samples/electornot (voting data from http://electornot.org.uk/). Here's a profile: https://gist.github.com/1ee9356f45330e9f8caa Here's a recent profile: https://gist.github.com/1ee9356f45330e9f8caa It's doing pretty well on this data, in that it's taking 8 seconds for 180k records (down from about 80 seconds earlier today) on my computer, but I'd like it to be faster - ideally under a second. Admittedly this is an arbitrary figure and I will just move the goal posts if I achieve it, but it would still be nice. :-) I suspect that it may be hitting the limits of what can be improved without actually changing the algorithm (which I'm open to doing, but not if it makes the results significantly worse), but the last 3 or 4 times I've thought that it was shortly followed by an optimisation which halved the runtime or better. In particular the most likely culprit for improvement is to deal with the fact that it allocates nearly 5GB of data (that appears to be mostly transient - watching the process as it runs the memory usage never seems to raise above 150MB, which is large but unproblematic). The biggest culprit for this is divideAndConquer which is probably generating a silly number of intermediate lists, but it's not obvious how to fix that - maybe by making the tree structure in the recursion more explicit? Don't know. Suggestions very welcome. Regards, David

On Tue, Mar 22, 2011 at 00:59, David MacIver
It's for rank aggregation - taking a bunch of partial rankings of some items from users and turning them into an overall ranking (aka "That thing that Hammer Principle does").
Two questions immediately begs themselves: * Can we go parallel? :P * What does +RTS -s -RTS say? Specifically, what is the current productivity? Do we get an improvement with +RTS -A2m -H128m -RTS ? (Force the heap to be somewhat up there from day one, perhaps try -H256m. -- J.

On 22 March 2011 02:00, Jesper Louis Andersen
On Tue, Mar 22, 2011 at 00:59, David MacIver
wrote: It's for rank aggregation - taking a bunch of partial rankings of some items from users and turning them into an overall ranking (aka "That thing that Hammer Principle does").
Two questions immediately begs themselves:
* Can we go parallel? :P
Maybe. A lot of this is inherently sequential. Some bits are parallelisable, but my initial attempts at exploiting that made very little performance difference. I'd rather exhaust what I can from single-core performance first.
* What does +RTS -s -RTS say? Specifically, what is the current productivity?
./rank +RTS -s 3,466,696,368 bytes allocated in the heap 212,888,240 bytes copied during GC 51,949,568 bytes maximum residency (10 sample(s)) 5,477,016 bytes maximum slop 105 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 6546 collections, 0 parallel, 0.93s, 0.93s elapsed Generation 1: 10 collections, 0 parallel, 0.32s, 0.32s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 7.11s ( 7.12s elapsed) GC time 1.25s ( 1.25s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.37s ( 8.37s elapsed) %GC time 15.0% (15.0% elapsed) Alloc rate 487,319,292 bytes per MUT second Productivity 85.0% of total user, 85.0% of total elapsed So if I'm reading this right, my hypothesis that allocation was most of the cost seems to be wrong? I don't know how much of that MUT time is allocation, but I'd expect it to be < GC time.
Do we get an improvement with +RTS -A2m -H128m -RTS ? (Force the heap to be somewhat up there from day one, perhaps try -H256m.
This seems to consistently give about a 0.4s improvement, which isn't nothing but isn't a particularly interesting chunck of 8s (actually it's 8.4s -> 8s). Setting it to 256M doesn't make any difference.

On Tue, Mar 22, 2011 at 1:11 AM, David MacIver
On 22 March 2011 02:00, Jesper Louis Andersen
wrote: On Tue, Mar 22, 2011 at 00:59, David MacIver
wrote: It's for rank aggregation - taking a bunch of partial rankings of some items from users and turning them into an overall ranking (aka "That thing that Hammer Principle does").
Two questions immediately begs themselves:
* Can we go parallel? :P
Maybe. A lot of this is inherently sequential. Some bits are parallelisable, but my initial attempts at exploiting that made very little performance difference. I'd rather exhaust what I can from single-core performance first.
* What does +RTS -s -RTS say? Specifically, what is the current productivity?
./rank +RTS -s 3,466,696,368 bytes allocated in the heap 212,888,240 bytes copied during GC 51,949,568 bytes maximum residency (10 sample(s)) 5,477,016 bytes maximum slop 105 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 6546 collections, 0 parallel, 0.93s, 0.93s elapsed Generation 1: 10 collections, 0 parallel, 0.32s, 0.32s elapsed
INIT time 0.00s ( 0.00s elapsed) MUT time 7.11s ( 7.12s elapsed) GC time 1.25s ( 1.25s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 8.37s ( 8.37s elapsed)
%GC time 15.0% (15.0% elapsed)
Alloc rate 487,319,292 bytes per MUT second
Productivity 85.0% of total user, 85.0% of total elapsed
So if I'm reading this right, my hypothesis that allocation was most of the cost seems to be wrong? I don't know how much of that MUT time is allocation, but I'd expect it to be < GC time.
Do we get an improvement with +RTS -A2m -H128m -RTS ? (Force the heap to be somewhat up there from day one, perhaps try -H256m.
This seems to consistently give about a 0.4s improvement, which isn't nothing but isn't a particularly interesting chunck of 8s (actually it's 8.4s -> 8s). Setting it to 256M doesn't make any difference.
You should use criterion to make sure that your assessments of the performance difference are grounded in statistically robust reasoning :) Progression may also help. GHC 7 has a new rts option, +RTS -H -RTS that gives you a good high default for these memory numbers. You might give it ago, but I doubt it's going to help much here. One thing I noticed was that you calculate the length of elts more than once inside aggregate. It's probably not going to account for a big boost, but I'd recommend doing it only once. Something like: let lenelts = length elts Then use lenelts everywhere. Also, it seems like you use Data.Set to sort and uniquify the input then you throw away the Set, that's potentially costly. There are probably better types for aggregate than [[a]] -> [a]. You use Arrays, but it's possible that a different vector library such as Data.Vector gives better performance here. Or, perhaps you want that instead of lists everywhere. For example, you build an index and a reverse index. It seems like with an array or vector you could eliminate at least one index and have O(1) lookup time instead of Data.Map's O(log n) lookup time. I hope that helps, Jason

On 22 March 2011 15:49, Jason Dagit
This seems to consistently give about a 0.4s improvement, which isn't nothing but isn't a particularly interesting chunck of 8s (actually it's 8.4s -> 8s). Setting it to 256M doesn't make any difference.
You should use criterion to make sure that your assessments of the performance difference are grounded in statistically robust reasoning :) Progression may also help.
Thanks. These look useful.
GHC 7 has a new rts option, +RTS -H -RTS that gives you a good high default for these memory numbers. You might give it ago, but I doubt it's going to help much here.
It didn't seem to, no.
One thing I noticed was that you calculate the length of elts more than once inside aggregate. It's probably not going to account for a big boost, but I'd recommend doing it only once. Something like: let lenelts = length elts Then use lenelts everywhere.
This doesn't seem to make much difference - elts is only a thousand or so items.
Also, it seems like you use Data.Set to sort and uniquify the input then you throw away the Set, that's potentially costly.
This is actually the fastest way I've found to do it! It replaced an earlier implementation that looked like it should have been more optimised but wasn't. If you have any better suggestions, I'm definitely open to hearing them.
There are probably better types for aggregate than [[a]] -> [a].
What did you have in mind? That's largely the format the data comes in as, so other than changing from lists to some other ordered container I'm not sure what to change.
You use Arrays, but it's possible that a different vector library such as Data.Vector gives better performance here.
For the case I'm using arrays I'm specifically using unboxed double arrays with integer indices, mainly because I do a very large number of lookups.
Or, perhaps you want that instead of lists everywhere.
Possibly. I looked into it briefly and it's not that easy to do. In particular there are important bits in the code which depend on being able to do pull the list apart head/tail-wise. It might be possible for some of this to use vectors internally, or I might be able to replace those with something else. I'm not entirely sure.
For example, you build an index and a reverse index. It seems like with an array or vector you could eliminate at least one index and have O(1) lookup time instead of Data.Map's O(log n) lookup time.
Yeah, in particular the reverse index really should be a vector. But that particular code is only actually used once right at the end and takes a tiny amount of time, so it's not actually that useful to do so I think.
I hope that helps,
It does. Thanks! David

You use a lot of (linked lists). Are they all used to represent streams or are they actually manifest during runtime? If it's the latter switch to a better data structure, like Vector.
participants (4)
-
David MacIver
-
Jason Dagit
-
Jesper Louis Andersen
-
Johan Tibell