
So I've been playing with the median problem today. Not sure why, but it stuck in my head.
import Data.List import Control.Monad.ST import Data.STRef import Control.Monad
I've been using the hashing algorithm that I described last night, but it's quite slow. I must be missing something obvious. It really shouldn't be slow at all! So I have a median function:
median
First I assume you know the range of your data.
(min,max)
Then I ask you to figure out how many buckets you want to create, and which of those buckets you actually want to fill. You should only fill the buckets near the middle, based on an educated guess of the distribution of your data, otherwise, you will end up wasting memory. I you guessed the middle correctly, you will get back (Just median) otherwise, you will get back Nothing.
numBuckets guessedMiddle
Then you are to pass it your list.
list =
let
First I seperate out the values into a list of buckets, each one holding values which are near to eachother. I can figure out the length of the list at the same time, since I have to go through the whole thing anyways.
(myBuckets,length) = buckets (min,max) numBuckets guessedMiddle list
The buckets are set up, so that the first bucket has the lowest values, and the last bucket has the highest values. Each bucket, is a tuple, of it's length and it's contents. We fold across the list of buckets, accumulating the "index so far", until we find the bucket in which the median must reside.
Right (medianBucket,stubLen) = foldr (\thisBucket@(thisBucketLen,_) eitheriOrMedianBucket -> case eitheriOrMedianBucket of Left i -> if i + thisBucketLen > (length `div` 2) then Right (thisBucket, thisBucketLen-((length `div` 2) - i)) else Left (i + thisBucketLen) _ -> eitheriOrMedianBucket) (Left 0) myBuckets in
We then sort the bucket in which the median must reside, and we then find the median the normal way. This should be faster, since we only had to sort one bucket, rather than the entire list. It is not, so there must be something horrible going on.
case snd medianBucket of Just medianBucket' -> Just (sort medianBucket' `genericIndex` stubLen) Nothing -> Nothing
Here is my actual function for seperating the values out into buckets.
buckets (min,max) numBuckets (guessedMiddleStart,guessedMiddleEnd) list = runST $ do lengthRef <- newSTRef 0
First we create a list of empty buckets. Since it would be a waste of memory to actually fill the buckets near the edges of our distribution(where we are not likely to find our median), our buckets contain Maybe lists, and the buckets which are outside of our guessed bucket range will be filled with Nothing.
buckets' <- mapM (\n-> newSTRef (0, if n >= guessedMiddleStart && n <= guessedMiddleEnd then Just [] else Nothing)) [0..numBuckets]
Then we go through the buckets, figuring out which bucket to put a given value into. We calculate the length at the same time.
forM_ list $ \number -> do let
Figure out which bucket to put this into.
bucket = whichBucket (min,max) numBuckets number
Increment length.
modifySTRef lengthRef (+1)
Put the value into the appropriate bucket.
modifySTRef (buckets' `genericIndex` bucket) --Obvious optimization, use an array and not a list. (\(oldLen,oldListMaybe)-> case oldListMaybe of Just oldList -> (oldLen+1,Just (number:oldList)) Nothing -> (oldLen + 1, Nothing))
filledBuckets <- mapM readSTRef buckets' length <- readSTRef lengthRef return (filledBuckets,length)
whichBucket (min,max) numBuckets number = (number - min) `div` ((max - min) `div` numBuckets)
So I created a little test scenario.
someListNumBuckets = 100
someListGuessedMiddle = (45,55)
someListLength = genericLength someList
And found that this:
realMedian = sort someList `genericIndex` (someListLength `div` 2)
Is actually faster ^_^ :O *Main> realMedian 500 (7.18 secs, 2031543472 bytes) Than this:
someListMedian = median (someListMin,someListMax) someListNumBuckets someListGuessedMiddle someList
*Main> someListMedian Just 500 (37.77 secs, 15376209200 bytes)
someListMin :: Integer someListMin = 0
someListMax :: Integer someListMax = 1000
someList :: [Integer] someList = concatMap (\n->intersperse n [someListMin..someListMax]) [someListMax,someListMax-1..someListMin]
---------- Původní zpráva ----------
Od: timothyhobbs@seznam.cz
Datum: 3. 9. 2012
Předmět: Re: [Haskell-cafe] hstats median algorithm
"
It really depends on how you are reading in the data and what you plan to do
with it besides taking the median. Obviously, if you read in your data as
an ordered list things can be done O(n) without any trouble.
In another case, if you already know the range, you can make a hash table
and start at the middle bucket and move outwards. That will be O(n) + O
(bucketsize log(bucketsize)) given that the middle bucket is non empty and
I'm not horribly mistaken.
Tim
---------- Původní zpráva ----------
Od: David Feuer