
First of all: I'm not sure if this question is allowed here. If not, I apologize I'm trying to solve the following problem: For each word in a text find the number of occurences for each unique word in the text. i've come up with the following steps to solve this: * remove all punctuation except for whitespace and make the text lowercase * find all unique words in the text * for each unique word, count the number of occurences. This has resulted in the following code: removePunctuation :: [Char] -> [Char] removePunctuation str = filter (\c -> elem c (['a'..'z'] ++ ['A'..'Z'] ++ ['\t', ' ', '\n'])) str process :: [Char] -> [String] process str = words (map toLower (removePunctuation str)) unique :: (Eq a) => [a] -> [a] unique [] = [] unique (x:xs) = [x] ++ unique (filter (\s -> x /= s) xs) occurenceCount :: (Eq a) => a -> [a] -> Int occurenceCount _ [] = 0 occurenceCount x (y:ys) | x == y = 1 + occurenceCount x ys | otherwise = occurenceCount x ys occurenceCount' :: [String] -> [String] -> [(String, Int)] occurenceCount' [] _ = [("", 0)] occurenceCount' (u:us) xs = [(u, occurenceCount u xs)] ++ occurenceCount' us xs Please remember i've only been playing with Haskell for three afternoons now and i'm happy that the above code is working correctly. However i've got three questions: 1) occurenceCount' [] _ = [("", 0)] is plain ugly and also adds a useless tuple to the end result. Is there a better way to solve this? 2) I'm forcing elements into a singleton list on two occasions, both in my unique function and in my occurenceCount' function. Once again this seems ugly and I'm wondering if there is a better solution. 3) The whole process as i'm doing it now feels pretty imperatively (been working for years as a Java / PHP programmer). I've got this feeling that the occurenceCount' function could be implemented using a mapping function. What ways are there to make this more "functional"? -- View this message in context: http://old.nabble.com/-Newbie--What-to-improve-in-my-code-tp29156025p2915602... Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

On Tuesday 13 July 2010 23:49:45, Frank1981 wrote:
First of all: I'm not sure if this question is allowed here. If not, I apologize
I'm trying to solve the following problem: For each word in a text find the number of occurences for each unique word in the text.
i've come up with the following steps to solve this: * remove all punctuation except for whitespace and make the text lowercase * find all unique words in the text * for each unique word, count the number of occurences.
This has resulted in the following code: removePunctuation :: [Char] -> [Char] removePunctuation str = filter (\c -> elem c (['a'..'z'] ++ ['A'..'Z'] ++ ['\t', ' ', '\n'])) str
Depending on your criteria, maybe import Data.Char removePunctuation = filter (\c -> isAlpha c || isSpace c) is better
process :: [Char] -> [String] process str = words (map toLower (removePunctuation str))
Or perhaps process = map (fiter isLower) . words . map toLower
unique :: (Eq a) => [a] -> [a] unique [] = [] unique (x:xs) = [x] ++ unique (filter (\s -> x /= s) xs)
import Data.List unique = nub but it's not particularly efficient. If you don't need to keep the order of first occurrence and have an Ord instance, you could take unique' = map head . group . sort or import qualified Data.Set as Set unique'' = Set.toList . Set.fromList
occurenceCount :: (Eq a) => a -> [a] -> Int occurenceCount _ [] = 0 occurenceCount x (y:ys)
| x == y = 1 + occurenceCount x ys | otherwise = occurenceCount x ys
occurrenceCount a xs = length (filter (== a) xs) or occurrenceCount a = length . filter (== a)
occurenceCount' :: [String] -> [String] -> [(String, Int)] occurenceCount' [] _ = [("", 0)]
why not occurrenceCount' [] _ = [] ?
occurenceCount' (u:us) xs = [(u, occurenceCount u xs)] ++ occurenceCount' us xs
But it can be done shorter: import qualified Data.Map as Map import Data.List occurrenceCount'' :: Ord a => [a] -> [(a,Int)] occurrenceCount'' xs = Map.toList $ foldl' (\mp x -> Map.insertWith' (+) x 1 mp) Map.empty xs No need to get the unique elements up front.
Please remember i've only been playing with Haskell for three afternoons now and i'm happy that the above code is working correctly.
However i've got three questions: 1) occurenceCount' [] _ = [("", 0)] is plain ugly and also adds a useless tuple to the end result. Is there a better way to solve this? 2) I'm forcing elements into a singleton list on two occasions, both in my unique function and in my occurenceCount' function. Once again this seems ugly and I'm wondering if there is a better solution.
Use (:), e.g. unique (x:xs) = x : unique (filter (/= x) xs)
3) The whole process as i'm doing it now feels pretty imperatively (been working for years as a Java / PHP programmer). I've got this feeling that the occurenceCount' function could be implemented using a mapping function. What ways are there to make this more "functional"?

Daniel Fischer
First of all: I'm not sure if this question is allowed here. If not, I apologize
You might want to check out the haskell-beginners list, but IMO most questions are okay to post here. Just a couple of style issues Daniel didn't mention:
process :: [Char] -> [String] process str = words (map toLower (removePunctuation str))
It's a matter of taste, but I think this reads clearer if written: process = words . map toLower . removePunctuation
unique :: (Eq a) => [a] -> [a] unique [] = [] unique (x:xs) = [x] ++ unique (filter (\s -> x /= s) xs)
Also 'filter (\s -> x /= s)' can be written as filter (x /=)
import qualified Data.Map as Map import Data.List
occurrenceCount'' :: Ord a => [a] -> [(a,Int)] occurrenceCount'' xs = Map.toList $ foldl' (\mp x -> Map.insertWith' (+) x 1 mp) Map.empty xs
Note the primes here! This is perhaps my most common use of Map, and because of laziness, it is very easy to blow the stack. Although you really want to store an Int for each key, the default is to store an unevaluated computation, in this case a tower of (1+(1+(1+..))). The foldl' and insertWith' functions are stricter, and presumably Daniel gets this right (I'm never comfortable without testing this myself :).
3) The whole process as i'm doing it now feels pretty imperatively (been working for years as a Java / PHP programmer). I've got this feeling that the occurenceCount' function could be implemented using a mapping function. What ways are there to make this more "functional"?
I don't think I agree with this sentiment - you're building a pipeline of functions, not setting variables or otherwise mixing state or other imperativeness. Why do you think it's imperative? -k -- If I haven't seen further, it is by standing in the footprints of giants

On Tue, Jul 13, 2010 at 11:49 PM, Frank1981
First of all: I'm not sure if this question is allowed here. If not, I apologize
I'm trying to solve the following problem: For each word in a text find the number of occurences for each unique word in the text.
i've come up with the following steps to solve this: * remove all punctuation except for whitespace and make the text lowercase
A minor point: instead of removing the punctuation, you maybe should convert it to whitespace. Otherwise in texts like "there was a quick,brown fox" (notice the missing space after the comma) you'll have the word "quickbrown" instead of 2 words "quick" and "brown". David.

On Mon, Jul 19, 2010 at 9:24 AM, David Virebayre
A minor point: instead of removing the punctuation, you maybe should convert it to whitespace.
Otherwise in texts like "there was a quick,brown fox" (notice the missing space after the comma) you'll have the word "quickbrown" instead of 2 words "quick" and "brown".
If you remove punctuation you - run the risk of joining two valid words into one invalid word: "quick,brown" -> "quickbrown" - run the risk of converting one word into a different word: "can't" -> "cant" "won't" -> "wont" If you split at punctuation you create more semi-words: "can't" -> "can", "t" "shouldn't" -> "shouldn" "t" It might be better regarding in-word apostrophes as letters in this case? -- Dougal Stanton dougal@dougalstanton.net // http://www.dougalstanton.net
participants (5)
-
Daniel Fischer
-
David Virebayre
-
Dougal Stanton
-
Frank1981
-
Ketil Malde