Efficient functional idiom for histogram

How would I efficiently write a function in Haskell to count occurrences of unique elements in a (potentially very large) list? For example, given the list [1,2,3,4,5,3,4,2,4] I would like the output [[1,1], [2,2], [3,2], [4,3], [5,1]] (or some equivalent representation). Clearly, this won't be possible for infinite input, but I would like it to be as efficient as possible for (lazy) input lists containing many millions of elements. So an implementation based on group . sort is not going to work. In an imperative language like Python, I'd use a dictionary as an accumulator - something like for el in input: accums[i] = accums.get(i, 0) + 1 That's about as efficient as you can hope for (assuming an efficient dictionary implementation). How would I code something equivalent in Haskell? Thanks, Paul.

Paul Moore
How would I efficiently write a function in Haskell to count occurrences of unique elements in a (potentially very large) list? For example, given the list [1,2,3,4,5,3,4,2,4] I would like the output [[1,1], [2,2], [3,2], [4,3], [5,1]] (or some equivalent representation).
import qualified Data.Map as Map
import Data.Map (Map)
histogram :: Ord a => [a] -> [(a,Int)]
histogram = Map.assocs . foldl f Map.empty
where
f m k = Map.insertWith (+) k 1 m
G.
--
Gregory Collins

Gregory Collins
Paul Moore
writes: How would I efficiently write a function in Haskell to count occurrences of unique elements in a (potentially very large) list? For example, given the list [1,2,3,4,5,3,4,2,4] I would like the output [[1,1], [2,2], [3,2], [4,3], [5,1]] (or some equivalent representation).
import qualified Data.Map as Map import Data.Map (Map)
histogram :: Ord a => [a] -> [(a,Int)] histogram = Map.assocs . foldl f Map.empty where f m k = Map.insertWith (+) k 1 m
Here "foldl" should be "foldl'" (d'oh!)
G
--
Gregory Collins

2009/7/31 Gregory Collins
Paul Moore
writes: How would I efficiently write a function in Haskell to count occurrences of unique elements in a (potentially very large) list? For example, given the list [1,2,3,4,5,3,4,2,4] I would like the output [[1,1], [2,2], [3,2], [4,3], [5,1]] (or some equivalent representation).
import qualified Data.Map as Map import Data.Map (Map)
histogram :: Ord a => [a] -> [(a,Int)] histogram = Map.assocs . foldl f Map.empty where f m k = Map.insertWith (+) k 1 m
Right. I see how that works, and can work out how to think about this sort of thing from your example. Thanks very much. BTW, I did know that Haskell had an efficient map implementation, I just wasn't sure how to use it "functionally" - I probably should have searched a bit harder for examples before posting. Thanks for the help in any case. Paul.

2009/8/1 Paul Moore
BTW, I did know that Haskell had an efficient map implementation, I just wasn't sure how to use it "functionally" - I probably should have searched a bit harder for examples before posting. Thanks for the help in any case.
Know that Data.Map uses size balanced trees and is not e.g. a hash map. -- Deniz Dogan

2009/7/31 Paul Moore
2009/7/31 Gregory Collins
: Paul Moore
writes: How would I efficiently write a function in Haskell to count occurrences of unique elements in a (potentially very large) list? For example, given the list [1,2,3,4,5,3,4,2,4] I would like the output [[1,1], [2,2], [3,2], [4,3], [5,1]] (or some equivalent representation).
import qualified Data.Map as Map import Data.Map (Map)
histogram :: Ord a => [a] -> [(a,Int)] histogram = Map.assocs . foldl f Map.empty where f m k = Map.insertWith (+) k 1 m
Right. I see how that works, and can work out how to think about this sort of thing from your example. Thanks very much.
BTW, I did know that Haskell had an efficient map implementation, I just wasn't sure how to use it "functionally" - I probably should have searched a bit harder for examples before posting. Thanks for the help in any case.
Hmm, I'm obviously still mucking up the performance somehow. My full program (still a toy, but a step on the way to what I'm aiming at) is as follows. It's rolling 3 6-sided dice 100000 times, and printing a summary of the results. import System.Random import qualified Data.Map as Map import Data.Map (Map) import Data.List dice :: Int -> Int -> IO Int dice 0 n = return 0 dice m n = do total <- dice (m - 1) n roll <- randomRIO (1, n) return (total + roll) simulate count m n = do mapM (dice m) (replicate count n) histogram :: Ord a => [a] -> [(a,Int)] histogram = Map.assocs . foldl f Map.empty where f m k = Map.insertWith (+) k 1 m simulation = do lst <- simulate 100000 3 6 return (histogram lst) main = do s <- simulation putStrLn (show s) When compiled, this takes over twice as long as a naively implemented Python program. What am I doing wrong here? I'd have expected compiled Haskell to be faster than interpreted Python, so obviously my approach is wrong. I'm expecting the answer to be that I've got unnecessary laziness - which is fine, but ultimately my interest is in ease of expression and performance combined, so I'm looking for beginner-level improvements rather than subtle advanced techniques like unboxing. Thanks, Paul. PS I know my code is probably fairly clumsy - I'd appreciate style suggestions, but my main interest here is whether a beginner, with a broad programming background, a basic understanding of Haskell, and access to Google, put together a clear, efficient, program (ie, the case where my usual scripting language is too slow and I want to knock something up quickly in a high-level, high-performance language).

On Sat, Aug 1, 2009 at 4:44 PM, Paul Moore
PS I know my code is probably fairly clumsy - I'd appreciate style suggestions, but my main interest here is whether a beginner, with a broad programming background, a basic understanding of Haskell, and access to Google, put together a clear, efficient, program (ie, the case where my usual scripting language is too slow and I want to knock something up quickly in a high-level, high-performance language).
Here is one way to rewrite your program. It improved the speed somewhat for me. I timed both programs on my computer. I suppose one could try using an array for calculating the histogram as well, but I only tried the simples thing here. I hope someone can weigh in with a more thorough analysis. Please note how I've avoided including the IO Monad in some type signatures by extracting the data from it locally (with <-). It is quite possible to apply the histogram function to the data before going through the IO Monad as well, but it doesn't appear to change the execution speed much here. Caveat: My testing wasn't extensive. I just compiled with -O and timed the programs a couple of times. import System.Random import qualified Data.Map as Map import Data.Map (Map) import Data.List diceRolls :: Int -> IO [Int] diceRolls highVal = do generator <- getStdGen return (randomRs (1, highVal) generator) groupDice :: Int -> [Int] -> [[Int]] groupDice chunk rolls = map (take chunk) $ iterate (drop chunk) rolls simulate :: Int -> Int -> Int -> IO [Int] simulate count m n = do rolls <- diceRolls n let sums = map sum $ groupDice m rolls return (take count sums) histogram :: Ord a => [a] -> [(a,Int)] histogram = Map.assocs . foldl f Map.empty where f m k = Map.insertWith (+) k 1 m simulation = do lst <- simulate 100000 3 6 return (histogram $ lst) main = do s <- simulation putStrLn (show s)

Paul Moore
What am I doing wrong here? I'd have expected compiled Haskell to be faster than interpreted Python, so obviously my approach is wrong.
Two things from my experience: Python associative arrays are fast, and Haskell random numbers are slow. So by building a map from random numbers, you are likely doing a fairly pessimal comparison. Did you profile to see where the time is spent?
I'm expecting the answer to be that I've got unnecessary laziness
IME, laziness usually affects space, but not so much time performance. Although 'insertWith (+)' looks like it would build unevaluated thunks for the sums. -k -- If I haven't seen further, it is by standing in the footprints of giants

2009/8/1 Ketil Malde
Paul Moore
writes: What am I doing wrong here? I'd have expected compiled Haskell to be faster than interpreted Python, so obviously my approach is wrong.
Two things from my experience: Python associative arrays are fast, and Haskell random numbers are slow. So by building a map from random numbers, you are likely doing a fairly pessimal comparison.
Hmm, I think you're saying that for this problem, Haskell is likely to be worse than Python. Interesting. I guess I'd assumed that something CPU-intensive like this would benefit from a compiled language. Just goes to show that intuition about performance is usually wrong (something I really should know by now :-)) Is the issue with random numbers just the implementation, or is it something inherent in the non-pure nature of a random number generator that makes it hard for Haskell to implement efficiently? If the latter, that probably makes Haskell a somewhat poor choice for simulation-type programs.
Did you profile to see where the time is spent?
Not until you suggested it :-) I was pleasantly surprised by how easy it was to do. The results bear out what you were saying, the bottleneck is entirely in the "dice" function.
I'm expecting the answer to be that I've got unnecessary laziness
IME, laziness usually affects space, but not so much time performance. Although 'insertWith (+)' looks like it would build unevaluated thunks for the sums.
That'll teach me to bandy about technical terms as if I understand them :-) Thanks for the comments, Paul.

On Aug 1, 2009, at 2:06 PM, Paul Moore wrote:
Is the issue with random numbers just the implementation, or is it something inherent in the non-pure nature of a random number generator that makes it hard for Haskell to implement efficiently? If the latter, that probably makes Haskell a somewhat poor choice for simulation-type programs.
Well, I'm not sure of the details, but in your original implementation, you're performing IO to pull the seed out of a ref at every iteration. Pekka Karjalainen's doesn't do that, which probably helps with the speedup. Along with that, Haskell has a fairly slow random implementation. As I recall however, this is partially because it hasn't received a great deal of optimization, but mainly because the algorithm itself fulfills some rather strong properties -- in particular it must be fairly statistically robust, and it must provide a "split" function which produces generators that are independently robust [1]. This limits algorithm choice quite a bit. For other random numbers, with different properties (faster, but with tradeoffs in robustness, or ability to split, or both), you can check hackage for at least mersenne-random and gsl-random. There may be others that I don't recall. Cheers, Sterl. [1] http://www.haskell.org/onlinereport/random.html

Sterling Clover wrote:
For other random numbers, with different properties (faster, but with tradeoffs in robustness, or ability to split, or both), you can check hackage for at least mersenne-random and gsl-random.
gsl-random is definitely worth a try for performance. I recently did a rough benchmark (not rigorous) of Control.Monad.Random, which uses System.Random, and Control.Monad.MC, which uses GSL.Random. Control.Monad.Random took about 3.6 times longer to generate a few million random numbers compared to Control.Monad.MC, and used five times as much heap. The GSL-based version generated about 340,000 random numbers per second on a 1.8GHz notebook. Anton

Am Samstag 01 August 2009 20:50:10 schrieb Sterling Clover:
On Aug 1, 2009, at 2:06 PM, Paul Moore wrote:
Is the issue with random numbers just the implementation, or is it something inherent in the non-pure nature of a random number generator that makes it hard for Haskell to implement efficiently? If the latter, that probably makes Haskell a somewhat poor choice for simulation-type programs.
If you view a PRNG as a function from the seed to the sequence of generated numbers or as a function state -> (bitpattern, newstate), PRNGs are pure (at least, I know no counterexample), so it's not inherently inefficient in Haskell, though it's probably still faster in C. One thing that makes StdGen slow is splittability, as Sterling points out below. For a simulation programme where you don't need splittability, choose a different PRNG.
Well, I'm not sure of the details, but in your original implementation, you're performing IO to pull the seed out of a ref at every iteration. Pekka Karjalainen's doesn't do that, which probably helps with the speedup. Along with that, Haskell has a fairly slow random implementation. As I recall however, this is partially because it hasn't received a great deal of optimization, but mainly because the algorithm itself fulfills some rather strong properties -- in particular it must be fairly statistically robust, and it must provide a "split" function which produces generators that are independently robust [1]. This limits algorithm choice quite a bit.
For other random numbers, with different properties (faster, but with tradeoffs in robustness, or ability to split, or both), you can check hackage for at least mersenne-random and gsl-random.
I didn't get much speedup with System.Random.Mersenne.Pure64 (might be because I have a 32-bit system and Word64 goes through foreign calls, if that is still the case), but GSL.Random.Gen reduced the time by a factor of over 5. It forces you back into IO and is a little less convenient, but if speed is a concern, it's a price worth to pay. --------------------------------------- module Main (main) where import GSL.Random.Gen import qualified Data.Map as Map import Data.Map (Map) import Data.List import System.IO.Unsafe import System.Time import Data.Word dice :: RNG -> Int -> Int -> IO Int dice _ 0 n = return 0 dice rng m n = do total <- dice rng (m - 1) n roll <- fmap (+1) $ getUniformInt rng n return (total + roll) simulate _ 0 _ _ = return [] simulate rng count m n = unsafeInterleaveIO $ do val <- dice rng m n tl <- simulate rng (count-1) m n return (val:tl) histogram :: Ord a => [a] -> [(a,Int)] histogram = Map.assocs . foldl' f Map.empty where f m k = Map.insertWith' (+) k 1 m simulation rng = do lst <- simulate rng 1000000 3 6 return (histogram lst) main = do rng <- newRNG mt19937 sd <- getTimeSeed setSeed rng sd -- omit seeding for reproducible results s <- simulation rng putStrLn (show s) getTimeSeed :: IO Word64 getTimeSeed = do TOD a b <- getClockTime return . fromInteger $ 10^6*a + b `quot` (10^6) --------------------------------------------------
There may be others that I don't recall.
Cheers, Sterl.

My measurements show that - using strict version of insertWith doesn't improve performance. - in case of compilation with -O2 flag foldl' also is equal to foldl (-O2 gives approx 2 time impovements).- using RandomGen and State monad to generate a list gives at least 4 times improvements (on 1 000 000 items). More complicated improvements (using Array, PRNG and so on) were not tested by me.

Dmitry Olshansky wrote:
My measurements show that... (-O2 gives approx 2 time impovements). ...using RandomGen and State monad to generate a list gives at least 4 times improvements (on 1 000 000 items).
You earlier said:
this takes over twice as long as a naively implemented Python program
So now our "naive" Haskell - ordinary usage of Data.Map and System.Random, without resorting to things like unboxed arrays - is beating naive Python handily. Correct? Or is this with an alternate RNG? Although I think even that would be fair, since Python uses Mersenne. Regards, Yitz

2009/8/5 Yitzchak Gale
Dmitry Olshansky wrote:
My measurements show that... (-O2 gives approx 2 time impovements). ...using RandomGen and State monad to generate a list gives at least 4 times improvements (on 1 000 000 items).
You earlier said:
this takes over twice as long as a naively implemented Python program
The latter was me :-)
So now our "naive" Haskell - ordinary usage of Data.Map and System.Random, without resorting to things like unboxed arrays - is beating naive Python handily. Correct?
I haven't checked myself (and won't have time in the next couple of weeks, as I'm on holiday - but I'll pick this up when I get back)., but it sounds like it. I'd like to check Dmitry's suggestions, mainly to see how they fit with my feel for "naive" (ie, at my beginner level, do I understand why they are more efficient). But I'd expect (compiled) Haskell to beat (interpreted) Python. That's sort of the point, really... The big measures for me are (1) by how much, and (2) how readable is the code.
Or is this with an alternate RNG? Although I think even that would be fair, since Python uses Mersenne.
I got the impression Dmitry was using Haskell's standard RNG, not Mersenne Twister. If so, then we'd get further improvements with MT, but that's still a hit against Haskell, as I'd interpret it as meaning that Haskell supplies as default a PRNG which costs noticeable performance in order to provide guarantees that "ordinary" programs don't need. Paul.

I got the impression Dmitry was using Haskell's standard RNG, not Mersenne Twister. If so, then we'd get further improvements with MT, but that's still a hit against Haskell, as I'd interpret it as meaning that Haskell supplies as default a PRNG which costs noticeable performance in order to provide guarantees that "ordinary" programs don't need.
Paul.
Yes, I used standard RNG just made corrections from the second Daniel Fischer's post. My file is attached. Commented strings are from original version and next lines are from Daniel. (Note that "simulation" and "simulate" types depend on type of "dice"). Just compile with ghc --make -O2 histogram.hs and run histogram +RTS -s As a pre-intermediate ;-) I just check some proposals...

Paul Moore wrote:
2009/8/5 Yitzchak Gale
: Or is this with an alternate RNG? Although I think even that would be fair, since Python uses Mersenne.
I got the impression Dmitry was using Haskell's standard RNG, not Mersenne Twister. If so, then we'd get further improvements with MT, but that's still a hit against Haskell, as I'd interpret it as meaning that Haskell supplies as default a PRNG which costs noticeable performance in order to provide guarantees that "ordinary" programs don't need.
I'm not sure how fair that is. For most "ordinary" programs I've written that use randomness anywhere, they have many different places that want randomness. Having a single global seed can suffice for this, but it introduces severe constraints that are seldom noticed. Namely, these are in fact *P*RNGs. If every place needing randomness is given its own seed, then it becomes possible to store all those seeds, allowing for replay. Replay can be good for capturing the history of a particular run of a simulation/game ---which is difficult at best in non-Haskell approaches. And more critically the seeds can be stored in a "core dump" allowing replay as a debugging tool, allowing reproduction of 'random' bugs which are otherwise quite difficult to track down. Imperative languages could use the same algorithms as Haskell, but they do not; which means these benefits of PRNGs are seldom even thought of. By separating concerns, the Haskell approach not only leads to cleaner code, but that separation can also be used to add new capabilities. The invisible manacles of imperativism should not be looked upon lightly. -- Live well, ~wren

I'm expecting the answer to be that I've got unnecessary laziness - which is fine, but ultimately my interest is in ease of expression and performance combined, so I'm looking for beginner-level improvements rather than subtle advanced techniques like unboxing.
You're right, it is too lazy. Here are a couple of strictifications that should help:
histogram :: Ord a => [a] -> [(a,Int)] histogram = Map.assocs . foldl f Map.empty where f m k = Map.insertWith (+) k 1 m
Turn foldl into foldl' (from Data.List) and Map.insertWith into Map.insertWith'. The strict versions simply force the intermediate structures to be evaluated, rather than hanging around as large accumulations of closures. Regards, Malcolm

Am Samstag 01 August 2009 15:44:39 schrieb Paul Moore:
2009/7/31 Paul Moore
: 2009/7/31 Gregory Collins
: Hmm, I'm obviously still mucking up the performance somehow. My full program (still a toy, but a step on the way to what I'm aiming at) is as follows. It's rolling 3 6-sided dice 100000 times, and printing a summary of the results.
import System.Random import qualified Data.Map as Map import Data.Map (Map) import Data.List
dice :: Int -> Int -> IO Int dice 0 n = return 0 dice m n = do total <- dice (m - 1) n roll <- randomRIO (1, n) return (total + roll)
Don't do too much in IO, it's better to separate the pure parts from the IO. IMO, this would better have signature dice :: RandomGen g => Int -> Int -> g -> (Int,g) dice 0 _ g = (0,g) dice m n g = case dice (m-1) n g of (total,g1) -> case randomR (1,n) g1 of (roll,g2) -> (total+roll,g2) or, better still be in a State monad or the Random monad ( http://hackage.haskell.org/package/MonadRandom ) die :: RandomGen g => Int -> State g Int die n = State $ randomR (1,n) dice :: RandomGen g => Int -> Int -> State g Int dice m n = liftM sum $ replicateM m (die n)
-- the "do" is superfluous
simulate count m n = do mapM (dice m) (replicate count n)
Ouch, that hurts (not yet so incredibly much for 100000 rolls, but if you try 1000000, it'll really hurt). Since you're doing it in IO, the whole list must be built before any further processing can begin, so you're building up a largish list, only to destroy it immediately afterwards, much work for the garbage collector. If you can accumulate the scores as they come, the intermediate list can be fused away and the garbage collector is kept idle. If you absolutely have to do it in IO, use import System.IO.Unsafe simulate 0 _ _ = return [] simulate count m n = unsafeInterleaveIO $ do val <- dice m n rst <- simulate (count-1) m n return (val:rst) to avoid building a large list. If you use the (lazy) State monad, that's automatically done :). simulate count m n = replicateM count (dice m n) -- now in State histogram :: Ord a => [a] -> [(a,Int)] histogram = Map.assocs . foldl f Map.empty where f m k = Map.insertWith (+) k 1 m -- simulation :: RandomGen g => State g [(Int,Int)] simulation = do lst <- simulate 1000000 3 6 return (histogram lst) main = do sg <- getStdGen print $ evalState simulation sg much faster, still not very fast, since StdGen isn't a particularly fast PRNG. Another method is to create an infinite list of random numbers and use it as needed: ------------------------------------------------------- module Main (main) where import System.Random import Data.Array.Unboxed import Data.List import System.Environment (getArgs) dice :: RandomGen g => g -> Int -> [Int] dice g mx = randomRs (1,mx) g splits :: Int -> [a] -> [[a]] splits l = unfoldr f where f xs = case splitAt l xs of r@(h,t) | null t -> Nothing | otherwise -> Just r simulation :: RandomGen g => g -> Int -> Int -> Int -> UArray Int Int simulation g rep dn df = accumArray (+) 0 (dn,dn*df) lst where rls = dice g df scs = splits dn rls lst = take rep [(sum rll,1) | rll <- scs] main :: IO () main = do (rp:dn:df:_) <- getArgs sg <- getStdGen print $ assocs $ simulation sg (read rp) (read dn) (read df) ------------------------------------------------------------- Using an unboxed array instead of a Map gives a little extra speed, but not much.
histogram :: Ord a => [a] -> [(a,Int)] histogram = Map.assocs . foldl f Map.empty where f m k = Map.insertWith (+) k 1 m
For some reason it doesn't make much difference here, but it should be the strict versions, foldl' and insertWith' in general.
simulation = do lst <- simulate 100000 3 6 return (histogram lst)
main = do s <- simulation putStrLn (show s)
When compiled, this takes over twice as long as a naively implemented Python program.
What am I doing wrong here? I'd have expected compiled Haskell to be faster than interpreted Python, so obviously my approach is wrong. I'm expecting the answer to be that I've got unnecessary laziness
Quite on the contrary, it's unnecessary strictness here :D
- which is fine, but ultimately my interest is in ease of expression and performance combined, so I'm looking for beginner-level improvements rather than subtle advanced techniques like unboxing.
Nothing advanced with using unboxed arrays.
Thanks, Paul.
PS I know my code is probably fairly clumsy
Actually, the style is rather good, I think (mine's worse, usually). You shouldn't use IO so much, though, and your code betrays a certain level of unfamiliarity with strictness/performance characteristics of the libraries. But that's natural.
- I'd appreciate style suggestions, but my main interest here is whether a beginner, with a broad programming background, a basic understanding of Haskell, and access to Google, put together a clear, efficient, program (ie, the case where my usual scripting language is too slow and I want to knock something up quickly in a high-level, high-performance language).
Performance is a nontrivial thing, it takes some experience to know which data structures to use when. And, as said above, Haskell's StdGen isn't fast, the above programme spends about 90% of the time creating pseudo random numbers.

In an imperative language like Python, I'd use a dictionary as an accumulator - something like
for el in input: accums[i] = accums.get(i, 0) + 1
Haskell has efficient dictionary structures too, e.g. Data.Map List.foldl' (\m x-> Map.insertWith' (+) x 1 m) Map.empty Regards, Malcolm

Am Freitag 31 Juli 2009 22:39:53 schrieb Paul Moore:
How would I efficiently write a function in Haskell to count occurrences of unique elements in a (potentially very large) list? For example, given the list [1,2,3,4,5,3,4,2,4] I would like the output [[1,1], [2,2], [3,2], [4,3], [5,1]] (or some equivalent representation).
Clearly, this won't be possible for infinite input, but I would like it to be as efficient as possible for (lazy) input lists containing many millions of elements. So an implementation based on group . sort is not going to work.
In an imperative language like Python, I'd use a dictionary as an accumulator - something like
for el in input: accums[i] = accums.get(i, 0) + 1
That's about as efficient as you can hope for (assuming an efficient dictionary implementation). How would I code something equivalent in Haskell?
If the elements come from a relatively small range and are suitable for array indexing, import Data.Array.IArray import Data.Array.Unboxed accumArray :: (IArray a e, Ix i) => (e -> e' -> e) -- ^ An accumulating function -> e -- ^ A default element -> (i,i) -- ^ The bounds of the array -> [(i, e')] -- ^ List of associations -> a i e -- ^ Returns: the array accumArray (+) 0 (mini,maxi) $ zip list (repeat 1) is pretty fast (beats the hell out of Data.Map). If your elements can't be unboxed, the accumArray function from Data.Array does it too, albeit much slower (still faster than Data.Map, in my experience).
Thanks, Paul.
participants (12)
-
Anton van Straaten
-
Daniel Fischer
-
Deniz Dogan
-
Dmitry Olshansky
-
Gregory Collins
-
Ketil Malde
-
Malcolm Wallace
-
Paul Moore
-
Pekka Karjalainen
-
Sterling Clover
-
wren ng thornton
-
Yitzchak Gale