hstats median algorithm

newer
Cabal install fails due to recent...

David Feuer

1 Sep 2012 1 Sep '12

7:26 p.m.

The median function in the hstats package uses a naive O(n log n) algorithm. Is there another package providing an O(n) option? If not, what would it take to get the package upgraded?

Show replies by date

Ivan Lazar Miljenovic

1 Sep 1 Sep

10:25 p.m.

On 2 September 2012 05:26, David Feuer wrote:

...

The median function in the hstats package uses a naive O(n log n) algorithm. Is there another package providing an O(n) option? If not, what would it take to get the package upgraded?

http://hackage.haskell.org/packages/archive/statistics/latest/doc/html/Stati... ? hstats appears not to have been updated since 2008 and doesn't build with GHC 7.2 (and presumably 7.4). Maybe try contacting the maintainer?

...

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

In my experience, doing much better than the naive algorithm for median is surprisingly hard, and involves a choice from a range of trade-offs. Did you have a particular better algorithm in mind? If you did, you could write it, and contact the package author with a patch. You also may be able to find something of use in Edward Kmett's order-statistics package: http://hackage.haskell.org/package/order-statistics Cheers, Gershom On 9/1/12 3:26 PM, David Feuer wrote:

...

The median function in the hstats package uses a naive O(n log n) algorithm. Is there another package providing an O(n) option? If not, what would it take to get the package upgraded?

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

KC

6:14 a.m.

Is there any special structure in your data that could be exploited? On Sat, Sep 1, 2012 at 6:17 PM, Gershom Bazerman wrote:

...

In my experience, doing much better than the naive algorithm for median is surprisingly hard, and involves a choice from a range of trade-offs. Did you have a particular better algorithm in mind?

If you did, you could write it, and contact the package author with a patch.

You also may be able to find something of use in Edward Kmett's order-statistics package: http://hackage.haskell.org/package/order-statistics

Cheers, Gershom

On 9/1/12 3:26 PM, David Feuer wrote:

...
The median function in the hstats package uses a naive O(n log n) algorithm. Is there another package providing an O(n) option? If not, what would it take to get the package upgraded?

-- -- Regards, KC

David Feuer

11:02 p.m.

I was thinking it should offer a randomized version (taking a generator), since randomized median algorithms provide the best expected performance. It could also offer a deterministic version using some variant of median-of-medians, intended for long lists. I guess it probably should retain the naive version for short lists. Some benchmarking would suggest a good cutoff. Has anyone come up with a better practical deterministic O(n) algorithm since median-of-medians? I saw a paper by Dorit Dor on reducing the number of comparisons to a bit under 3n, which also showed a lower bound of a bit over 2n, but the algorithm she gives strikes me as far too complex to be practical. On Sep 1, 2012 9:17 PM, "Gershom Bazerman" wrote:

...

In my experience, doing much better than the naive algorithm for median is surprisingly hard, and involves a choice from a range of trade-offs. Did you have a particular better algorithm in mind?

If you did, you could write it, and contact the package author with a patch.

You also may be able to find something of use in Edward Kmett's order-statistics package: http://hackage.haskell.org/** package/order-statisticshttp://hackage.haskell.org/package/order-statistics

Cheers, Gershom

On 9/1/12 3:26 PM, David Feuer wrote:

...
The median function in the hstats package uses a naive O(n log n) algorithm. Is there another package providing an O(n) option? If not, what would it take to get the package upgraded?

______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafe http://www.haskell.org/mailman/listinfo/haskell-cafe

timothyhobbs＠seznam.cz

3 Sep 3 Sep

1 a.m.

It really depends on how you are reading in the data and what you plan to do with it besides taking the median. Obviously, if you read in your data as an ordered list things can be done O(n) without any trouble. In another case, if you already know the range, you can make a hash table and start at the middle bucket and move outwards. That will be O(n) + O (bucketsize log(bucketsize)) given that the middle bucket is non empty and I'm not horribly mistaken. Tim ---------- Původní zpráva ---------- Od: David Feuer Datum: 3. 9. 2012 Předmět: Re: [Haskell-cafe] hstats median algorithm " I was thinking it should offer a randomized version (taking a generator), since randomized median algorithms provide the best expected performance. It could also offer a deterministic version using some variant of median-of- medians, intended for long lists. I guess it probably should retain the naive version for short lists. Some benchmarking would suggest a good cutoff. Has anyone come up with a better practical deterministic O(n) algorithm since median-of-medians? I saw a paper by Dorit Dor on reducing the number of comparisons to a bit under 3n, which also showed a lower bound of a bit over 2n, but the algorithm she gives strikes me as far too complex to be practical. On Sep 1, 2012 9:17 PM, "Gershom Bazerman" wrote: " In my experience, doing much better than the naive algorithm for median is surprisingly hard, and involves a choice from a range of trade-offs. Did you have a particular better algorithm in mind? If you did, you could write it, and contact the package author with a patch. You also may be able to find something of use in Edward Kmett's order- statistics package: http://hackage.haskell.org/package/order-statistics (http://hackage.haskell.org/package/order-statistics) Cheers, Gershom On 9/1/12 3:26 PM, David Feuer wrote: " The median function in the hstats package uses a naive O(n log n) algorithm. Is there another package providing an O(n) option? If not, what would it take to get the package upgraded? _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org(mailto:Haskell-Cafe@haskell.org) http://www.haskell.org/mailman/listinfo/haskell-cafe (http://www.haskell.org/mailman/listinfo/haskell-cafe) " " "

timothyhobbs＠seznam.cz

1:07 a.m.

Sorry, I am horribly mistaken. Hash table is O(n)+O(numbuckets)+O (middlebucketsize log(middlebucketsize))... It's too late for this stuff... Tim ---------- Původní zpráva ---------- Od: timothyhobbs@seznam.cz Datum: 3. 9. 2012 Předmět: Re: [Haskell-cafe] hstats median algorithm " It really depends on how you are reading in the data and what you plan to do with it besides taking the median. Obviously, if you read in your data as an ordered list things can be done O(n) without any trouble. In another case, if you already know the range, you can make a hash table and start at the middle bucket and move outwards. That will be O(n) + O (bucketsize log(bucketsize)) given that the middle bucket is non empty and I'm not horribly mistaken. Tim ---------- Původní zpráva ---------- Od: David Feuer Datum: 3. 9. 2012 Předmět: Re: [Haskell-cafe] hstats median algorithm " I was thinking it should offer a randomized version (taking a generator), since randomized median algorithms provide the best expected performance. It could also offer a deterministic version using some variant of median-of- medians, intended for long lists. I guess it probably should retain the naive version for short lists. Some benchmarking would suggest a good cutoff. Has anyone come up with a better practical deterministic O(n) algorithm since median-of-medians? I saw a paper by Dorit Dor on reducing the number of comparisons to a bit under 3n, which also showed a lower bound of a bit over 2n, but the algorithm she gives strikes me as far too complex to be practical. On Sep 1, 2012 9:17 PM, "Gershom Bazerman" wrote: " In my experience, doing much better than the naive algorithm for median is surprisingly hard, and involves a choice from a range of trade-offs. Did you have a particular better algorithm in mind? If you did, you could write it, and contact the package author with a patch. You also may be able to find something of use in Edward Kmett's order- statistics package: http://hackage.haskell.org/package/order-statistics (http://hackage.haskell.org/package/order-statistics) Cheers, Gershom On 9/1/12 3:26 PM, David Feuer wrote: " The median function in the hstats package uses a naive O(n log n) algorithm. Is there another package providing an O(n) option? If not, what would it take to get the package upgraded? _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org(mailto:Haskell-Cafe@haskell.org) http://www.haskell.org/mailman/listinfo/haskell-cafe (http://www.haskell.org/mailman/listinfo/haskell-cafe) " " " "

timothyhobbs＠seznam.cz

12:57 p.m.

So I've been playing with the median problem today. Not sure why, but it stuck in my head.

...

import Data.List import Control.Monad.ST import Data.STRef import Control.Monad

I've been using the hashing algorithm that I described last night, but it's quite slow. I must be missing something obvious. It really shouldn't be slow at all! So I have a median function:

...

median

First I assume you know the range of your data.

...

(min,max)

Then I ask you to figure out how many buckets you want to create, and which of those buckets you actually want to fill. You should only fill the buckets near the middle, based on an educated guess of the distribution of your data, otherwise, you will end up wasting memory. I you guessed the middle correctly, you will get back (Just median) otherwise, you will get back Nothing.

...

numBuckets guessedMiddle

Then you are to pass it your list.

...

list =

...

let

First I seperate out the values into a list of buckets, each one holding values which are near to eachother. I can figure out the length of the list at the same time, since I have to go through the whole thing anyways.

...

(myBuckets,length) = buckets (min,max) numBuckets guessedMiddle list

The buckets are set up, so that the first bucket has the lowest values, and the last bucket has the highest values. Each bucket, is a tuple, of it's length and it's contents. We fold across the list of buckets, accumulating the "index so far", until we find the bucket in which the median must reside.

...

Right (medianBucket,stubLen) = foldr (\thisBucket@(thisBucketLen,_) eitheriOrMedianBucket -> case eitheriOrMedianBucket of Left i -> if i + thisBucketLen > (length `div` 2) then Right (thisBucket, thisBucketLen-((length `div` 2) - i)) else Left (i + thisBucketLen) _ -> eitheriOrMedianBucket) (Left 0) myBuckets in

We then sort the bucket in which the median must reside, and we then find the median the normal way. This should be faster, since we only had to sort one bucket, rather than the entire list. It is not, so there must be something horrible going on.

...

case snd medianBucket of Just medianBucket' -> Just (sort medianBucket' `genericIndex` stubLen) Nothing -> Nothing

Here is my actual function for seperating the values out into buckets.

...

buckets (min,max) numBuckets (guessedMiddleStart,guessedMiddleEnd) list = runST $ do lengthRef <- newSTRef 0

First we create a list of empty buckets. Since it would be a waste of memory to actually fill the buckets near the edges of our distribution(where we are not likely to find our median), our buckets contain Maybe lists, and the buckets which are outside of our guessed bucket range will be filled with Nothing.

...

buckets' <- mapM (\n-> newSTRef (0, if n >= guessedMiddleStart && n <= guessedMiddleEnd then Just [] else Nothing)) [0..numBuckets]

Then we go through the buckets, figuring out which bucket to put a given value into. We calculate the length at the same time.

...

forM_ list $ \number -> do let

Figure out which bucket to put this into.

...

bucket = whichBucket (min,max) numBuckets number

Increment length.

...

modifySTRef lengthRef (+1)

Put the value into the appropriate bucket.

...

modifySTRef (buckets' `genericIndex` bucket) --Obvious optimization, use an array and not a list. (\(oldLen,oldListMaybe)-> case oldListMaybe of Just oldList -> (oldLen+1,Just (number:oldList)) Nothing -> (oldLen + 1, Nothing))

...

filledBuckets <- mapM readSTRef buckets' length <- readSTRef lengthRef return (filledBuckets,length)

...

whichBucket (min,max) numBuckets number = (number - min) `div` ((max - min) `div` numBuckets)

So I created a little test scenario.

...

someListNumBuckets = 100

...

someListGuessedMiddle = (45,55)

...

someListLength = genericLength someList

And found that this:

...

realMedian = sort someList `genericIndex` (someListLength `div` 2)

Is actually faster ^_^ :O *Main> realMedian 500 (7.18 secs, 2031543472 bytes) Than this:

...

someListMedian = median (someListMin,someListMax) someListNumBuckets someListGuessedMiddle someList

*Main> someListMedian Just 500 (37.77 secs, 15376209200 bytes)

...

someListMin :: Integer someListMin = 0

...

someListMax :: Integer someListMax = 1000

...

someList :: [Integer] someList = concatMap (\n->intersperse n [someListMin..someListMax]) [someListMax,someListMax-1..someListMin]

---------- Původní zpráva ---------- Od: timothyhobbs@seznam.cz Datum: 3. 9. 2012 Předmět: Re: [Haskell-cafe] hstats median algorithm " It really depends on how you are reading in the data and what you plan to do with it besides taking the median. Obviously, if you read in your data as an ordered list things can be done O(n) without any trouble. In another case, if you already know the range, you can make a hash table and start at the middle bucket and move outwards. That will be O(n) + O (bucketsize log(bucketsize)) given that the middle bucket is non empty and I'm not horribly mistaken. Tim ---------- Původní zpráva ---------- Od: David Feuer Datum: 3. 9. 2012 Předmět: Re: [Haskell-cafe] hstats median algorithm " I was thinking it should offer a randomized version (taking a generator), since randomized median algorithms provide the best expected performance. It could also offer a deterministic version using some variant of median-of- medians, intended for long lists. I guess it probably should retain the naive version for short lists. Some benchmarking would suggest a good cutoff. Has anyone come up with a better practical deterministic O(n) algorithm since median-of-medians? I saw a paper by Dorit Dor on reducing the number of comparisons to a bit under 3n, which also showed a lower bound of a bit over 2n, but the algorithm she gives strikes me as far too complex to be practical. On Sep 1, 2012 9:17 PM, "Gershom Bazerman" wrote: " In my experience, doing much better than the naive algorithm for median is surprisingly hard, and involves a choice from a range of trade-offs. Did you have a particular better algorithm in mind? If you did, you could write it, and contact the package author with a patch. You also may be able to find something of use in Edward Kmett's order- statistics package: http://hackage.haskell.org/package/order-statistics (http://hackage.haskell.org/package/order-statistics) Cheers, Gershom On 9/1/12 3:26 PM, David Feuer wrote: " The median function in the hstats package uses a naive O(n log n) algorithm. Is there another package providing an O(n) option? If not, what would it take to get the package upgraded? _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org(mailto:Haskell-Cafe@haskell.org) http://www.haskell.org/mailman/listinfo/haskell-cafe (http://www.haskell.org/mailman/listinfo/haskell-cafe) " " " "

Felipe Almeida Lessa

2:18 p.m.

Some comments wrt. performance: On Mon, Sep 3, 2012 at 9:57 AM, wrote:

...

...
Right (medianBucket,stubLen) = foldr (\thisBucket@(thisBucketLen,_) eitheriOrMedianBucket -> case eitheriOrMedianBucket of Left i -> if i + thisBucketLen > (length `div` 2) then Right (thisBucket, thisBucketLen-((length `div` 2) - i)) else Left (i + thisBucketLen) _ -> eitheriOrMedianBucket) (Left 0) myBuckets

Use a let to store length `div` 2, GHC probably won't do this for you automatically. Ditto for i + thisBucketLen.

...

...
buckets' <- mapM (\n-> newSTRef (0, if n >= guessedMiddleStart && n <= guessedMiddleEnd then Just [] else Nothing)) [0..numBuckets]

You should really use a mutable array here instead of a list of STRefs (I recommend vector's STVector [1]). [1] http://hackage.haskell.org/packages/archive/vector/0.9.1/doc/html/Data-Vecto...

...

Increment length.

...
modifySTRef lengthRef (+1)

This will create a big thunk for the length, you should use oldLength <- readSTRef lengthRef writeSTRef lengthRef $! oldLength + 1 (I'm not sure if a strict modifySTRef exists somewhere...)

...

Put the value into the appropriate bucket.

...
modifySTRef (buckets' `genericIndex` bucket) --Obvious optimization, use an array and not a list. (\(oldLen,oldListMaybe)-> case oldListMaybe of Just oldList -> (oldLen+1,Just (number:oldList)) Nothing -> (oldLen + 1, Nothing))

Ditto for oldLen here. Also, you can simplify this lambda a lot: import Control.Applicative ((<$>)) \(oldLen, oldVal) -> let newLen = oldLen + 1 newVal = (number:) <$> oldVal in newLen `seq` newVal `seq` (newLen, newVal) HTH! Cheers, -- Felipe.

Felipe Almeida Lessa

2:19 p.m.

On Mon, Sep 3, 2012 at 11:18 AM, Felipe Almeida Lessa wrote:

...

Ditto for oldLen here. Also, you can simplify this lambda a lot:

import Control.Applicative ((<$>))

\(oldLen, oldVal) -> let newLen = oldLen + 1 newVal = (number:) <$> oldVal in newLen `seq` newVal `seq` (newLen, newVal)

Or, using BangPatterns, \(oldLen, oldVal) -> let !newLen = oldLen + 1 !newVal = (number:) <$> oldVal in (newLen, newVal) Cheers, -- Felipe.

timothyhobbs＠seznam.cz

7 p.m.

Thanks for the advice. After taking most of it it is faster. But it is still many times slower than it ought to be! This algorithm should be much faster than simply sorting the list, and yet it is more than twice as slow! One note, you said: """"

...

Increment length.

...
modifySTRef lengthRef (+1)

This will create a big thunk for the length, you should use oldLength <- readSTRef lengthRef writeSTRef lengthRef $! oldLength + 1 (I'm not sure if a strict modifySTRef exists somewhere...)"""" Actually, replacing modifySTRef with that code is just a hair slower... Not sure why. I'm attaching the super optimized version. Along with the profiler output. I just can't understand what is slow here :( Timothy ---------- Původní zpráva ---------- Od: Felipe Almeida Lessa Datum: 3. 9. 2012 Předmět: Re: [Haskell-cafe] hstats median algorithm "On Mon, Sep 3, 2012 at 11:18 AM, Felipe Almeida Lessa wrote:

...

Ditto for oldLen here. Also, you can simplify this lambda a lot:

import Control.Applicative ((<$>))

\(oldLen, oldVal) -> let newLen = oldLen + 1 newVal = (number:) <$> oldVal in newLen `seq` newVal `seq` (newLen, newVal)

Or, using BangPatterns, \(oldLen, oldVal) -> let !newLen = oldLen + 1 !newVal = (number:) <$> oldVal in (newLen, newVal) Cheers, -- Felipe."

timothyhobbs＠seznam.cz

7:55 p.m.

Aha!!! Now it's working. Just had to compile with -O2 :D :D Now I'm over twice as fast for a list of 2 million! With better length based analysis of how many buckets should be used, this number can be improved. You can feel free to use my code however you like. I've attached the final version. That was fun :D Timothy ---------- Původní zpráva ---------- Od: timothyhobbs@seznam.cz Datum: 3. 9. 2012 Předmět: Re: [Haskell-cafe] hstats median algorithm " Thanks for the advice. After taking most of it it is faster. But it is still many times slower than it ought to be! This algorithm should be much faster than simply sorting the list, and yet it is more than twice as slow! One note, you said: """"

...

Increment length.

...
modifySTRef lengthRef (+1)

...

Ditto for oldLen here. Also, you can simplify this lambda a lot:

import Control.Applicative ((<$>))

\(oldLen, oldVal) -> let newLen = oldLen + 1 newVal = (number:) <$> oldVal in newLen `seq` newVal `seq` (newLen, newVal)

Or, using BangPatterns, \(oldLen, oldVal) -> let !newLen = oldLen + 1 !newVal = (number:) <$> oldVal in (newLen, newVal) Cheers, -- Felipe." "

4702

Age (days ago)

4704

Last active (days ago)

List overview

Download

11 comments

6 participants

participants (6)

David Feuer
Felipe Almeida Lessa
Gershom Bazerman
Ivan Lazar Miljenovic
KC
timothyhobbs＠seznam.cz