Re: [Haskell-cafe] Nice way to calculate character frequency in a string

It seems like the fastest way to build a list of frequency from a string is this : import Data.Array (assocs, accumArray, Array) frequency :: String -> [(Char,Int)] frequency = filter (\f -> snd f>0) . assocs . accumArray (+) 0 ('\0', '\255') . map (\x->(x,1)) -- (~1.3 sec on a string of 700ko) But if we want to make this polymorphic we have to do this : import Data.Ix (Ix) frequency :: (Ix a, Bounded a) => [a] -> [(a,Int)] frequency = filter (\f -> snd f>0) . assocs . accumArray (+) 0 (minBound, maxBound) . map (\x->(x,1)) -- (>5 sec on a string of 700ko) but it's really much slower, it seems like min/maxBound are check each time or I dont know, and it's restrictive on the type of the elements of the list. The best polymorphic version seems to be : import Data.Map as Map (lookup, assocs, empty, insertWith) frequency :: (Ord a) => [a] -> [(a,Int)] frequency xs = Map.assocs $ foldl f Map.empty xs where f m x = let m' = Map.insertWith (+) x 1 m Just v = Map.lookup x m' in seq v m' -- (~2.1 sec on a string of 700ko) Which is not as fast as the specialized version on string, but really faster and less restrictive than the second. Thanks everyone for all your answers ! Charles

On 10/25/05, Charles SDudu
It seems like the fastest way to build a list of frequency from a string is this : import Data.Array (assocs, accumArray, Array) frequency :: String -> [(Char,Int)] frequency = filter (\f -> snd f>0) . assocs . accumArray (+) 0 ('\0', '\255') . map (\x->(x,1)) -- (~1.3 sec on a string of 700ko)
But if we want to make this polymorphic we have to do this : import Data.Ix (Ix) frequency :: (Ix a, Bounded a) => [a] -> [(a,Int)] frequency = filter (\f -> snd f>0) . assocs . accumArray (+) 0 (minBound, maxBound) . map (\x->(x,1)) -- (>5 sec on a string of 700ko) but it's really much slower, it seems like min/maxBound are check each time or I dont know, and it's restrictive on the type of the elements of the list.
Most likely maxBound is waaaaaay larger than '\255'. On my system it's '\1114111'. So it's not really a fair comparison. Perhaps the frequency function should just take the bounds as a parameter? Also, you may want to look into making the array you're creating unboxed (should work by just importing Data.Array.Unboxed instead, or giving the type of the result of accumArray "UArray a Int" explicitly). On my system this gave about 40% better performance on some random text file I found. Also, you may use STArrays (I think they come in unboxed as well) for stateful code, which may be even faster (unless accumArray does some neat trick to make it O(m) where m is the number of index/value pairs). /S -- Sebastian Sylvan +46(0)736-818655 UIN: 44640862

Sebastian Sylvan wrote:
Also, you may use STArrays (I think they come in unboxed as well) for stateful code, which may be even faster (unless accumArray does some neat trick to make it O(m) where m is the number of index/value pairs).
The whole idea with having accumArray as part of the Array signature is exactly that it should be O(m). It's an operation you cannot define in "pure Haskell", i.e., Haskell without any extra types like STArray. -- Lennart
participants (3)
-
Charles SDudu
-
Lennart Augustsson
-
Sebastian Sylvan