Nice way to calculate character frequency in a string

Hello, I need to calculate the frequency of each character in a String. And if I can do this really well in C, I dont find a nice (and fast) answer in haskell. I tried several functions, listed below, and even the fastest do a lot of unnecessary things : calc :: String -> [ (Char, Int) ] -- 3.0s normally (without profiling) -- time 10-12% alloc 59% (info from profiling) -- so it's the fastest when I profile but not when I compile normally -- mutable array may be better but it's to complicated for me calc = filter (\p -> snd p > 0) . assocs . foldl (\k c -> unsafeReplace k [(fromEnum c, (unsafeAt k (fromEnum c))+1)] ) k where k = array (toEnum 0, toEnum 255) [(toEnum i, 0) | i <- [0 .. 255]] :: UArray Char Int -- 2.1s normally -- time 15-19% alloc 40% (info from profiling) -- so for true, it's the best but the sort and group probably do unnecessary things calc s = map (\l -> (head l, length l)) $ group $ sort s -- 3.4s normally -- time 58% alloc 0% (info from profiling) -- this one dont do unnecessary things but has to read the file again for each character -- calc s = map (\c -> (c, foldl (\a b -> if b==c then a+1 else a) 0 s)) $ nub s -- 22s normally -- time 85% alloc 92% (info from profiling) -- this one read the file only one time but is really slow calc = foldl (addfreq) [] where addfreq f c = let xs1 = takeWhile (\f -> fst f /= c) f xs2 = dropWhile (\f -> fst f /= c) f xs = if null xs2 then [(c,1)] else ((fst . head) xs2, (snd . head) xs2 + 1) : tail xs2 in xs1 ++ xs -- I have a lot of even slower version but I wont include them -- each compilation was done with GHC 6.4.1 with the -O flag and with -O -prof -auto-all for profiling Thanks for your answer, Charles PS : Yes, english is not my mother language... :-(

Two solutions using immutable and mutable arrays and no unsafe operations: ---- module Main where import Control.Monad.ST import Data.Ix import Data.Array import Data.Array.MArray import Data.Array.ST -- using immutable arrays hist1 :: String -> Array Char Int hist1 str = accumArray (+) 0 ('\0','\255') [(c,1) | c<-str] freq1 :: String -> [(Char,Int)] freq1 = assocs . hist1 -- using mutable ST arrays hist2 :: String -> STArray s Char Int -> ST s () hist2 str arr = sequence_ [do { i<-readArray arr c; writeArray arr c (1+i) } | c<-str] freq2 :: String -> [(Char,Int)] freq2 str = runST (do { arr<-newArray ('\0', '\255') 0 ; hist2 str arr ; getAssocs arr }) -- Cheers, Pedro

Pedro Baltazar Vasconcelos wrote:
Two solutions using immutable and mutable arrays and no unsafe operations:
Both solutions certainly count as nice, but both exhibit an ugly memory leak. As usual, this is due to too much laziness: no intermediate result is ever evaluated until it is too late. GHCi dies on both (freq1 $ replicate 1000000 'x') and (freq2 $ replicate 1000000 'x') with a stack overflow. Both versions can be fixed by using unboxed arrays, which is fine when counting Ints, but already impossible with Intergers. The mutable version has an easy general fix:
-- using mutable ST arrays hist2 :: String -> STArray s Char Int -> ST s () hist2 str arr = sequence_ [do { i<-readArray arr c; writeArray arr c $! 1+i } | c<-str]
Note the strict application ^^ What I find unsettling, is that the nice solution, the only one not to rely on GHC specific extensions, cannot be fixed:
hist1 str = accumArray (+) 0 ('\0','\255') [(c,1) | c<-str]
No amount of strictness annotations can fix this, the correct place would be *inside* accumArray. The same problem arises with other containers that provide accum-like functions, notably Data.Map.insertWith and Data.Map.mapAccum. That raises the question: Should combining functions on containers be provided in a strict variant? Should strict application be the default? If something turns out to be too strict, I can always wrap it in a data type; if it turns out to be too lazy, I'm hosed. Or am I overlooking something? Udo.

Udo Stenzel wrote:
That raises the question: Should combining functions on containers be provided in a strict variant? Should strict application be the default?
With the exception of lists, I generally tend to want strict behavior for collections. Combined with the principle of least surprise, my preference would be strict defaults with lazy alternatives (i.e. Data.Map is strict, Data.Map.Lazy is - well, you get the picture :-). (I also still think that unboxed arrays could be replaced by strict arrays -- if compilers are sufficiently able to unbox strict elements, that is. AFAICT, unboxing is an implementation issue, strictness is the semantic part of it, and that is what should be reflected in the interface) -k

On 10/25/05, Charles SDudu
Hello, I need to calculate the frequency of each character in a String. And if I can do this really well in C, I dont find a nice (and fast) answer in haskell. I tried several functions, listed below, and even the fastest do a lot of unnecessary things :
calc :: String -> [ (Char, Int) ]
-- 3.0s normally (without profiling) -- time 10-12% alloc 59% (info from profiling) -- so it's the fastest when I profile but not when I compile normally -- mutable array may be better but it's to complicated for me
calc = filter (\p -> snd p > 0) . assocs . foldl (\k c -> unsafeReplace k [(fromEnum c, (unsafeAt k (fromEnum c))+1)] ) k where k = array (toEnum 0, toEnum 255) [(toEnum i, 0) | i <- [0 .. 255]] :: UArray Char Int
-- 2.1s normally -- time 15-19% alloc 40% (info from profiling) -- so for true, it's the best but the sort and group probably do unnecessary things calc s = map (\l -> (head l, length l)) $ group $ sort s
-- 3.4s normally -- time 58% alloc 0% (info from profiling) -- this one dont do unnecessary things but has to read the file again for each character -- calc s = map (\c -> (c, foldl (\a b -> if b==c then a+1 else a) 0 s)) $ nub s
-- 22s normally -- time 85% alloc 92% (info from profiling) -- this one read the file only one time but is really slow calc = foldl (addfreq) [] where addfreq f c = let xs1 = takeWhile (\f -> fst f /= c) f xs2 = dropWhile (\f -> fst f /= c) f xs = if null xs2 then [(c,1)] else ((fst . head) xs2, (snd . head) xs2 + 1) : tail xs2 in xs1 ++ xs
-- I have a lot of even slower version but I wont include them -- each compilation was done with GHC 6.4.1 with the -O flag and with -O -prof -auto-all for profiling
Try this:
import Data.Array.ST import Data.Array.Base import Control.Monad import Control.Monad.ST import Data.Word import Data.Char
main = do c <- getContents print (frequency c)
frequency str = runST (frequency' str)
frequency' :: String -> ST s [(Char,Int)] frequency' str = do arr <- newArray ('\0','\255') 0 :: ST s (STUArray s Char Int) flip mapM_ str $ \c -> do r <- unsafeRead arr (ord c) unsafeWrite arr (ord c) (r+1) liftM (filter (\(c,n) -> n>0)) (getAssocs arr)
-- Friendly, Lemmih

On 2005-10-25 at 12:20+0200 Lemmih wrote:
On 10/25/05, Charles SDudu
wrote: Hello, I need to calculate the frequency of each character in a String. And if I can do this really well in C, I dont find a nice (and fast) answer in haskell. I tried several functions, listed below, and even the fastest do a lot of unnecessary things :
calc :: String -> [ (Char, Int) ] calc = filter (\p -> snd p > 0) . assocs . foldl (\k c -> unsafeReplace k [(fromEnum c, (unsafeAt k (fromEnum c))+1)] ) k where k = array (toEnum 0, toEnum 255) [(toEnum i, 0) | i <- [0 .. 255]] :: UArray Char Int
[snip even more disagreable code] Ugh! These are all horrid. If something on the lines of
calc = accumArray (+) 0 (minBound, maxBound) . (map (\x->(x,1)))
isn't fast enough, complain to the implementors! What's the point of functional programming if one has to twist into a shape that allows inspection of one's own fundament to get stuff to run in decent time? Jón -- Jón Fairbairn Jon.Fairbairn at cl.cam.ac.uk

On Tuesday 25 Oct 2005 10:40 am, Charles SDudu wrote:
Hello, I need to calculate the frequency of each character in a String.
Something like this maybe (untested code ahead).. import Data.COrdering import Data.Tree.AVL calc :: String -> [(Char,Int)] calc cs = asListL (genAsTree cc [(c,1) | c <- cs]) where cc (c0,n0) (c1,n1) = case compare c0 c1 of LT -> Lt GT -> Gt EQ -> let n=n0+n1 in n `seq` Eq (c0,n) But this uses packages you have to install yourself.. http://homepages.nildram.co.uk/~ahey/HLibs/Data.COrdering/ http://homepages.nildram.co.uk/~ahey/HLibs/Data.Tree.AVL/ You could probably do something similar with Data.Map or Data.IntMap Regards -- Adrian Hey
participants (7)
-
Adrian Hey
-
Charles SDudu
-
Jon Fairbairn
-
Ketil Malde
-
Lemmih
-
Pedro Baltazar Vasconcelos
-
Udo Stenzel