
Hello! I'm tryig to write efficient code for creating histograms. I have following requirements for it: 1. O(1) element insertion 2. No reallocations. Thus in place updates are needed. accumArray won't go because I need to fill a lot of histograms (hundrends) from vely long list of data (possibly millions of elements) and it will traverse input data for each histogram. It seems that I need to use mutable array and/or ST monad or something else. Sadly both of them are tricky and difficult to understand. So good examples or any other ideas greatly appreciated. -- Alexey Khudyakov

Alexey Khudyakov wrote:
Hello!
I'm tryig to write efficient code for creating histograms. I have following requirements for it:
1. O(1) element insertion 2. No reallocations. Thus in place updates are needed.
accumArray won't go because I need to fill a lot of histograms (hundrends) from vely long list of data (possibly millions of elements) and it will traverse input data for each histogram.
That's just not true, for GHC's implementation of accumArray at least, which goes via the ST monad. It creates a mutable array, fills it, traversing the input list exactly once, and finally freezes the array and returns it. This is just what you suggest below. If you still run into performance problems, try out unboxed arrays. If that isn't enough, unsafeAccumArray from Data.Base may help. I'd try both before using the ST monad directly.
It seems that I need to use mutable array and/or ST monad or something else. Sadly both of them are tricky and difficult to understand. So good examples or any other ideas greatly appreciated.
http://www.haskell.org/haskellwiki/Shootout/Nsieve_Bits perhaps. There must be better examples out there. I can think of two common problems with mutable arrays and ST: 1) You need to specify a type signature for the array being created, because the compiler can't guess the MArray instance that you want. For example, from the shootout entry: arr <- newArray (0,m) False :: IO (IOUArray Int Bool) In ST, this is slightly trickier, because the phantom 's' type parameter has to be mirrord in the ST*Array type constructor. You can use scoped type variables, which allow you to write {-# LANGUAGE ScopedTypeVariables #-} import Control.Monad.ST import Data.Array.ST foo :: forall s . ST s () foo = do arr <- newArray (0,42) False :: ST s (STUArray s Int Bool) ... Alternatively you can define helper functions to specify just the part of the type signature that you care about. stuArray :: ST s (STUArray s i e) -> ST s (STUArray s i e) stuArray = id foo :: ST s () foo = do arr <- stuArray $ newArray (0,42 :: Int) False ... 2) runST $ foo bar doesn't work. You have to write runST (foo bar) But in the end it's just imperative array programming with a rather verbose syntax -- you can do only one array access per statement, and 'readArray' and 'writeArray' are rather long names. HTH, Bertram

On Mon, 2008-11-10 at 18:05 +0100, Bertram Felgenhauer wrote:
Alexey Khudyakov wrote:
Hello!
I'm tryig to write efficient code for creating histograms. I have following requirements for it:
1. O(1) element insertion 2. No reallocations. Thus in place updates are needed.
accumArray won't go because I need to fill a lot of histograms (hundrends) from vely long list of data (possibly millions of elements) and it will traverse input data for each histogram.
That's just not true, for GHC's implementation of accumArray at least, which goes via the ST monad.
Perhaps I misunderstood but I think Alexey means that he wants to accumulate several different histograms (ie different arrays) but to only make one pass over the input data. The form of accumArray does not make that possible (unless one managed to pack the different histograms into different parts of the same array). If a fold using a pure persistent map type really is too slow then it should still be possible to do an ST implementation in a reasonably elegant way using a foldr on the input list, accumulating an ST action. Duncan

Duncan Coutts
Perhaps I misunderstood but I think Alexey means that he wants to accumulate several different histograms (ie different arrays) but to only make one pass over the input data.
This is precicely my problem, too.
The form of accumArray does not make that possible (unless one managed to pack the different histograms into different parts of the same array).
This would have been a good match for my case. Of course, that is a bit of a hack, and it would be much nicer and safer to do an accumArray with tuple elements. But I need to use an UArray in order to avoid the array filling up with unevaluated thunks, and - correct me if I'm wrong - there's no way to do a 'seq' before accumArray's insertion for complex types. So the stack blows up. /me wishes for s/Data.Array.Unboxed/Data.Array.Strict/g and instances for arbitrary types. Anyway, I thought of using offsets in the same UArray too late, and chose to finish my ST-based implementation instead.
If a fold using a pure persistent map type really is too slow then it should still be possible to do an ST implementation in a reasonably elegant way using a foldr on the input list, accumulating an ST action.
Or must mapM_ an insert operation (in ST) over the input data? -k -- If I haven't seen further, it is by standing in the footprints of giants

Alexey Khudyakov wrote:
Hello!
I'm tryig to write efficient code for creating histograms. I have following requirements for it:
1. O(1) element insertion 2. No reallocations. Thus in place updates are needed.
accumArray won't go because I need to fill a lot of histograms (hundrends) from vely long list of data (possibly millions of elements) and it will traverse input data for each histogram.
Sorry, Duncan is right. I misread here. My first idea would still be to use accumArray though, or rather, accum, processing the input data in chunks of an appropriate size (which depends on the histogram sizes.) But actually, the ST code isn't bad (I hope this isn't homework): ------------------------------------------------------------------------ import Control.Monad.ST import Control.Monad import Data.Array.ST import Data.Array.Unboxed stuArray :: ST s (STUArray s i e) -> ST s (STUArray s i e) stuArray = id -- Create histograms. -- -- Each histogram is described by a pair (f, (l, u)), where 'f' maps -- a data entry to an Int index, and l and u are lower and upper bounds -- of the indices, respectively. -- mkHistograms :: [(a -> Int, (Int, Int))] -> [a] -> [UArray Int Int] mkHistograms hs ds = runST collect where -- Why is the type signature on 'collect' required here? collect :: ST s [UArray Int Int] collect = do -- create histogram arrays of appropriate sizes histograms <- forM hs $ \(_, range) -> do stuArray $ newArray range 0 -- iterate through the data forM_ ds $ \d -> do -- iterate through the histograms forM_ (zip hs histograms) $ \((f, _), h) -> do -- update appropriate entry writeArray h (f d) . succ =<< readArray h (f d) -- finally, freeze the histograms and return them -- (using unsafeFreeze is ok because we're done modifying the -- arrays) mapM unsafeFreeze histograms test = mkHistograms [((`mod` 3), (0,2)), ((`mod` 5), (0,4))] [0..10] -- test returns -- [array (0,2) [(0,4),(1,4),(2,3)], -- array (0,4) [(0,3),(1,2),(2,2),(3,2),(4,2)]] ------------------------------------------------------------------------ Bertram P.S. Ryan is right, too - I'm not sure where I got confused there. runST $ foo didn't work in ghc 6.6; I knew that it works in ghc 6.8.3, but I thought this was changed again.
participants (5)
-
Alexey Khudyakov
-
Bertram Felgenhauer
-
Duncan Coutts
-
Ketil Malde
-
Ryan Ingram