
Another newbie question. The following is a naive attempt at a word count function, that takes a string and returns a list of word/count pairs. import Char isLetter c = isAlphaNum c || c=='\'' normalize = map toLower . filter isLetter isWord [] = False isWord s = True nwords = filter isWord . map normalize . words update w [] = [(w, 1)] update w ((x, n):xs) = if x==w then (x, n+1) : xs else (x, n) : (update w xs) wordCount = foldr update [] . nwords It works fine, but it's obviously very inefficient. I could make it less inefficient by implementing the word/count dictionary as a hash table or binary tree, but that's what standard libraries are for. So - what is there in the standard libraries that I could use in this scenario? I've looked at the Array module, but don't think it's quite suitable (we don't know the array bounds ahead of time, and the indices are strings). Data.HashTable looks better, but involves IO as it mutates the dictionary entries in place. Is there an alternative that could be used with a fold as in the code above? Dominic

Hello,
I found the following implementation using finite maps to work rather
well. (See http://www.haskell.org/ghc/docs/latest/html/libraries/base/Data.FiniteMap.ht...)
-------------------
import Char
import Data.FiniteMap
isLetter c = isAlphaNum c || c=='\''
normalize = map toLower . filter isLetter
nwords = filter (not . null) . map normalize . words
wordCount w = addListToFM_C (+) emptyFM (zip (nwords w) [1,1 ..])
main = do
s <- getContents
mapM_ print $ fmToList $ wordCount s
-----------------
On a file with 264724 words and 17391 distinct words, this had the
following times:
real 0m38.208s
user 0m4.302s
sys 0m0.214s
It was compiled with -O using ghc 6.2.2. I had to increse the stack
size slightly with a commandline option.
Most of the time was actually spent rendering the output. I didn't
bother to wait for the original code to finish, as it was reporting
only a few distinct words per second.
Hope this helps
- Cale
On Sat, 01 Jan 2005 13:28:41 +0000, Dominic Fox
Another newbie question.
The following is a naive attempt at a word count function, that takes a string and returns a list of word/count pairs.
import Char
isLetter c = isAlphaNum c || c=='\''
normalize = map toLower . filter isLetter
isWord [] = False isWord s = True
nwords = filter isWord . map normalize . words
update w [] = [(w, 1)] update w ((x, n):xs) = if x==w then (x, n+1) : xs else (x, n) : (update w xs)
wordCount = foldr update [] . nwords
It works fine, but it's obviously very inefficient. I could make it less inefficient by implementing the word/count dictionary as a hash table or binary tree, but that's what standard libraries are for. So - what is there in the standard libraries that I could use in this scenario? I've looked at the Array module, but don't think it's quite suitable (we don't know the array bounds ahead of time, and the indices are strings). Data.HashTable looks better, but involves IO as it mutates the dictionary entries in place. Is there an alternative that could be used with a fold as in the code above?
Dominic _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hello again,
Actually ran the original code (on the same file, still using
gnome-terminal). I noticed that it sped up as it went along, so it
didn't actually take quite as much time as I originally thought it
might.
real 13m5.190s
user 6m1.595s
sys 0m3.061s
By the way, the finitemap code, when run from the linux console, as
opposed to a gnome-terminal gives 5.189s of real time. (The disparity
between real and user times really was gnome-terminal's fault).
- Cale
On Mon, 3 Jan 2005 03:06:28 -0500, Cale Gibbard
Hello,
I found the following implementation using finite maps to work rather well. (See http://www.haskell.org/ghc/docs/latest/html/libraries/base/Data.FiniteMap.ht...) ------------------- import Char import Data.FiniteMap
isLetter c = isAlphaNum c || c=='\''
normalize = map toLower . filter isLetter
nwords = filter (not . null) . map normalize . words
wordCount w = addListToFM_C (+) emptyFM (zip (nwords w) [1,1 ..])
main = do s <- getContents mapM_ print $ fmToList $ wordCount s -----------------
On a file with 264724 words and 17391 distinct words, this had the following times: real 0m38.208s user 0m4.302s sys 0m0.214s
It was compiled with -O using ghc 6.2.2. I had to increase the stack size slightly with a commandline option.
Most of the time was actually spent rendering the output. I didn't bother to wait for the original code to finish, as it was reporting only a few distinct words per second.
Hope this helps - Cale
participants (2)
-
Cale Gibbard
-
Dominic Fox