Suspected stupid Haskell Question
Is there a library function to take a list of Strings and return a list of ints showing how many times each String occurs in the list. So for example: ["egg", "egg", "cheese"] would return [2,1] I couldn't find anything on a search, or anything in the librarys. Thanks BH. -- View this message in context: http://www.nabble.com/Suspected-stupid-Haskell-Question-tf4639170.html#a1325... Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.
I'm a newbie here, so I'm not sure about my reply, but I think this is not the answer to his question. freq ["egg", "egg", "cheese"] indeed returns [2,1] but freq ["egg", "cheese", "egg"] returns [1,1,1] BH just mentioned he needed the frequenty of elements in the list, independent of their order. So in that case, the result should be a list of ordered pairs like: [("egg", 2), ("cheese", 1)]. Or a pair of two lists, like (["egg", "cheese"), (2,1)]. Otherwise you would not know which frequency belongs to which element? I can't write this concisely nor efficient yet, but the following does the job: import Data.List freq xs = zip e f where s = sort xs e = nub s f = map length (group s) However, I suspect the experts here will be able to make that much shorter and more efficient (maybe using Data.Map?) Peter Stefan Holdermans wrote:
BH,
Is there a library function to take a list of Strings and return a list of ints showing how many times each String occurs in the list.
So for example:
["egg", "egg", "cheese"] would return [2,1]
freq xs = map length (group xs)
HTH,
Stefan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
you are indeed right Peter, that's what I was after, the frequency regardless of elements. It also doesn't matter if it outputs them as tuples, or as a separate list on their own because each value would belong to the first occurance of that element if you seem what I mean, so you could still tell what came from what. Peter Verswyvelen wrote:
I'm a newbie here, so I'm not sure about my reply, but I think this is not the answer to his question.
freq ["egg", "egg", "cheese"] indeed returns [2,1]
but
freq ["egg", "cheese", "egg"] returns [1,1,1]
BH just mentioned he needed the frequenty of elements in the list, independent of their order.
So in that case, the result should be a list of ordered pairs like: [("egg", 2), ("cheese", 1)]. Or a pair of two lists, like (["egg", "cheese"), (2,1)]. Otherwise you would not know which frequency belongs to which element?
I can't write this concisely nor efficient yet, but the following does the job:
import Data.List
freq xs = zip e f where s = sort xs e = nub s f = map length (group s)
However, I suspect the experts here will be able to make that much shorter and more efficient (maybe using Data.Map?)
Peter
Stefan Holdermans wrote:
BH,
Is there a library function to take a list of Strings and return a list of ints showing how many times each String occurs in the list.
So for example:
["egg", "egg", "cheese"] would return [2,1]
freq xs = map length (group xs)
HTH,
Stefan _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- View this message in context: http://www.nabble.com/Suspected-stupid-Haskell-Question-tf4639170.html#a1325... Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.
On 10/17/07, Peter Verswyvelen
So in that case, the result should be a list of ordered pairs like: [("egg", 2), ("cheese", 1)]. Or a pair of two lists, like (["egg", "cheese"), (2,1)]. Otherwise you would not know which frequency belongs to which element?
However, I suspect the experts here will be able to make that much shorter and more efficient (maybe using Data.Map?)
import Control.Arrow import Data.List freqs = map (head &&& length) . group . sort I have used this function quite a few times already. Stuart
Nice!!! As I'm learning Arrows now, this is really useful :-) Stuart Cook wrote:
import Control.Arrow import Data.List
freqs = map (head &&& length) . group . sort
I have used this function quite a few times already.
Stuart _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Oh, why didn't you say you were learning Arrows? Then why not freqs = sort >>> group >>> map (head &&& length) So much more readable, don't you think? ;) Either way, if you run into the dreaded monomorphism restriction: Ambiguous type variable `a' in the constraint: `Ord a' arising from use of `sort' at A.hs:6:40-43 Possible cause: the monomorphism restriction applied to the following: freqs :: [a] -> [(a, Int)] (bound at A.hs:6:0) Probable fix: give these definition(s) an explicit type signature or use -fno-monomorphism-restriction you'll have to either add an explicit type annotation: freqs :: (Ord a) => [a] -> [(a, Int)] or else throw an arg onto it: freqs x = map (head &&& length) . group . sort $ x The latter hurts too much to write, so I always add the type. Peter Verswyvelen wrote:
Nice!!! As I'm learning Arrows now, this is really useful :-)
Stuart Cook wrote:
import Control.Arrow import Data.List
freqs = map (head &&& length) . group . sort
I have used this function quite a few times already.
Stuart _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Peter Verswyvelen wrote:
However, I suspect the experts here will be able to make that much shorter and more efficient (maybe using Data.Map?)
That makes it difficult to respond. I am definitely not claiming to be an expert. For one thing, my name is not Simon. But I'll say something anyway, fwiw. The problem there is that nub is O(n^2). You're stuck with that if your type is not an instance of Ord (but then you can't use sort, either). When you can assume Ord, the standard solution is, as you suggest, something like: import qualified Data.Map as M import Data.List histogram = M.toList . foldl' (\m x -> M.insertWith' (+) x 1 m) M.empty . M.fromList This should work efficiently, with the right amount of laziness, even for very large lists. Regards, Yitz
I wrote:
When you can assume Ord, the standard solution is, as you suggest, something like...
Oops, sorry, doesn't typecheck. Here it is corrected:
import qualified Data.Map as M import Data.List
histogram = M.toList . foldl' (\m x -> M.insertWith' (+) x 1 m) M.empty
This should work efficiently, with the right amount of laziness, even for very large lists.
Stuart's Arrows thing is much nicer when your list is small enough to be held in memory all at once. -Yitz
On 17/10/2007, Big_Ham
Is there a library function to take a list of Strings and return a list of ints showing how many times each String occurs in the list.
So for example:
["egg", "egg", "cheese"] would return [2,1]
I couldn't find anything on a search, or anything in the librarys.
Thanks BH.
No, but it is also trivial to create, with the 'group' function in Data.List. I'll stop there though, cos this could be a homework question. Cheers, D.
On 17/10/2007, Dougal Stanton
No, but it is also trivial to create, with the 'group' function in Data.List. I'll stop there though, cos this could be a homework question.
It's just occurred to me that answering questions like these is a bit like the prisoner's dilemma. - If I give the full answer and no-one else does, then maybe I'm doing someone's homework for them? - If I just give clues and someone else gives the answer, it makes me look mean. :-( - If we all give the answer, everybody's happy and the blame (if it was a set question) is spread around a bit. - If we all answer with vague hints then it makes the list as a whole less useful and seem a bit arrogant. There's no way to win! :-) D.
Dougal Stanton wrote:
It's just occurred to me that answering questions like these is a bit like the prisoner's dilemma... There's no way to win! :-)
Yes there is. Just mention the following wiki page as part of your answer: http://haskell.org/haskellwiki/Homework_help -Yitz
Big_Ham
Is there a library function to take a list of Strings and return a list of ints showing how many times each String occurs in the list.
So for example:
["egg", "egg", "cheese"] would return [2,1]
I couldn't find anything on a search, or anything in the librarys.
Thanks BH.
Hi BH, This might be overkill, but it works well for me. And it avoid stack overflows I was originally getting for very large lists. Dean Herrington and I came up with this: ordTable :: (Ord a) => [a] -> [(a,Int)] ordTable xs = Map.assocs $! foldl' f Map.empty xs where f m x = let m' = Map.insertWith (+) x 1 m Just v = Map.lookup x m' in v `seq` m' intTable :: [Int] -> [(Int,Int)] intTable xs = IntMap.assocs $! foldl' f IntMap.empty xs where f m x = let m' = IntMap.insertWith (+) x 1 m Just v = IntMap.lookup x m' in v `seq` m' enumTable :: (Enum a) => [a] -> [(a,Int)] enumTable = map fstToEnum . intTable . map fromEnum where fstToEnum (x,y) = (toEnum x, y) If you like, it's easily wrapped in a Table class. Chad
Since I'm interested in the stack overflow issue, and getting acquainted
with quickcheck, I thought I would take this opportunity to compare your
ordTable with some code Yitzchak Gale posted earlier, against Ham's
original problem.
As far as I can tell, they're the same. They work on lists up to 100000
element lists of strings, but on 10^6 size lists I lose patience waiting
for them to finish.
Is there a more scientific way of figuring out if one version is better
than the other by using, say profiling tools?
Or by reasoning about the code?
t.
****************************************
import Data.List
import qualified Data.Map as M
import Control.Arrow
import Test.QuickCheck
import Test.GenTestData
import System.Random
{-
Is there a library function to take a list of Strings and return a list of
ints showing how many times each String occurs in the list.
So for example:
["egg", "egg", "cheese"] would return [2,1]
-}
testYitzGale n = do
l <- rgenBndStrRow (10,10) (10^n,10^n) -- 100000 strings, strings are
10 chars long, works. craps out on 10^6.
m <- return $ freqFold l
putStrLn $ "map items: " ++ ( show $ M.size m )
testCScherer n = do
l <- rgenBndStrRow (10,10) (10^n,10^n) -- same limitations as yitz gale
code.
m <- return $ ordTable l
putStrLn $ "items: " ++ ( show $ length m )
-- slow for big lists
--freqArr = Prelude.map ( last &&& length ) . group . sort
-- yitz gale code. same as chad scherer code? it's simpler to understand,
but is it as fast?
freqFold :: [[Char]] -> M.Map [Char] Int
freqFold = foldl' g M.empty
where g accum x = M.insertWith' (+) x 1 accum
-- c scherer code. insists on ord. far as I can tell, same speed as yitz.
ordTable :: (Ord a) => [a] -> [(a,Int)]
ordTable xs = M.assocs $! foldl' f M.empty xs
where f m x = let m' = M.insertWith (+) x 1 m
Just v = M.lookup x m'
in v `seq` m'
l = ["egg","egg","cheese"]
-- other quickcheck stuff
--prop_unchanged_by_reverse = \l -> ( freqArr (l :: [[Char]]) ) == (
freqArr $ reverse l )
--prop_freqArr_eq_freqFold = \l -> ( freqArr (l :: [[Char]]) == (freqFold
l))
--test1 = quickCheck prop_unchanged_by_reverse
--test2 = quickCheck prop_freqArr_eq_freqFold
--------------- generate test data:
genBndStrRow (minCols,maxCols) (minStrLen, maxStrLen) = rgen ( genBndLoL
(minStrLen, maxStrLen) (minCols,maxCols) )
gen gen = do
sg <- newStdGen
return $ generate 10000 sg gen
-- generator for a list with length between min and max
genBndList :: Arbitrary a => (Int, Int) -> Gen [a]
genBndList (min,max) = do
len <- choose (min,max)
vector len
-- lists of lists
--genBndLoL :: (Int, Int) -> (Int, Int) -> Gen [[a]]
genBndLoL (min1,max1) (min2,max2) = do
len1 <- choose (min1,max1)
len2 <- choose (min2,max2)
vec2 len1 len2
--vec2 :: Arbitrary a => Int -> Int -> Gen [[a]]
vec2 n m = sequence [ vector m | i <- [1..n] ]
Chad Scherrer
Is there a library function to take a list of Strings and return a list
of
ints showing how many times each String occurs in the list.
So for example:
["egg", "egg", "cheese"] would return [2,1]
I couldn't find anything on a search, or anything in the librarys.
Thanks BH.
Hi BH, This might be overkill, but it works well for me. And it avoid stack overflows I was originally getting for very large lists. Dean Herrington and I came up with this: ordTable :: (Ord a) => [a] -> [(a,Int)] ordTable xs = Map.assocs $! foldl' f Map.empty xs where f m x = let m' = Map.insertWith (+) x 1 m Just v = Map.lookup x m' in v `seq` m' intTable :: [Int] -> [(Int,Int)] intTable xs = IntMap.assocs $! foldl' f IntMap.empty xs where f m x = let m' = IntMap.insertWith (+) x 1 m Just v = IntMap.lookup x m' in v `seq` m' enumTable :: (Enum a) => [a] -> [(a,Int)] enumTable = map fstToEnum . intTable . map fromEnum where fstToEnum (x,y) = (toEnum x, y) If you like, it's easily wrapped in a Table class. Chad _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.
Hmm, is insertWith' new? If I remember right, I think the stack overflows
were happening because Map.insertWith isn't strict enough. Otherwise I think
the code is the same. But I would expect intTable to be faster, since it
uses IntMap, and there's no IntMap.insertWith' as of 6.6.1 (though it may be
easy enough to add one).
Chad
On 10/17/07, Thomas Hartman
Since I'm interested in the stack overflow issue, and getting acquainted with quickcheck, I thought I would take this opportunity to compare your ordTable with some code Yitzchak Gale posted earlier, against Ham's original problem.
As far as I can tell, they're the same. They work on lists up to 100000 element lists of strings, but on 10^6 size lists I lose patience waiting for them to finish.
Is there a more scientific way of figuring out if one version is better than the other by using, say profiling tools?
Or by reasoning about the code?
t.
****************************************
import Data.List import qualified Data.Map as M import Control.Arrow import Test.QuickCheck import Test.GenTestData import System.Random
{- Is there a library function to take a list of Strings and return a list of ints showing how many times each String occurs in the list.
So for example:
["egg", "egg", "cheese"] would return [2,1] -}
testYitzGale n = do l <- rgenBndStrRow (10,10) (10^n,10^n) -- 100000 strings, strings are 10 chars long, works. craps out on 10^6. m <- return $ freqFold l putStrLn $ "map items: " ++ ( show $ M.size m )
testCScherer n = do l <- rgenBndStrRow (10,10) (10^n,10^n) -- same limitations as yitz gale code. m <- return $ ordTable l putStrLn $ "items: " ++ ( show $ length m )
-- slow for big lists --freqArr = Prelude.map ( last &&& length ) . group . sort
-- yitz gale code. same as chad scherer code? it's simpler to understand, but is it as fast? freqFold :: [[Char]] -> M.Map [Char] Int freqFold = foldl' g M.empty where g accum x = M.insertWith' (+) x 1 accum -- c scherer code. insists on ord. far as I can tell, same speed as yitz. ordTable :: (Ord a) => [a] -> [(a,Int)] ordTable xs = M.assocs $! foldl' f M.empty xs where f m x = let m' = M.insertWith (+) x 1 m Just v = M.lookup x m' in v `seq` m'
l = ["egg","egg","cheese"]
-- other quickcheck stuff --prop_unchanged_by_reverse = \l -> ( freqArr (l :: [[Char]]) ) == ( freqArr $ reverse l ) --prop_freqArr_eq_freqFold = \l -> ( freqArr (l :: [[Char]]) == (freqFold l)) --test1 = quickCheck prop_unchanged_by_reverse --test2 = quickCheck prop_freqArr_eq_freqFold
--------------- generate test data: genBndStrRow (minCols,maxCols) (minStrLen, maxStrLen) = rgen ( genBndLoL (minStrLen, maxStrLen) (minCols,maxCols) )
gen gen = do sg <- newStdGen return $ generate 10000 sg gen
-- generator for a list with length between min and max genBndList :: Arbitrary a => (Int, Int) -> Gen [a] genBndList (min,max) = do len <- choose (min,max) vector len
-- lists of lists --genBndLoL :: (Int, Int) -> (Int, Int) -> Gen [[a]] genBndLoL (min1,max1) (min2,max2) = do len1 <- choose (min1,max1) len2 <- choose (min2,max2) vec2 len1 len2
--vec2 :: Arbitrary a => Int -> Int -> Gen [[a]] vec2 n m = sequence [ vector m | i <- [1..n] ]
Hi Chad, Chad Scherrer wrote:
I think the stack overflows were happening because Map.insertWith isn't strict enough. Otherwise I think the code is the same.
They are visibly almost identical - except that you do an extra lookup to get your strictness, while insertWith' has internal access and can do it in the same pass. So using insertWith' should be faster.
But I would expect intTable to be faster, since it uses IntMap,
I'm not sure if that's strictly true. I have never done any testing, but I get the feeling that the performance of IntMap may depend on the distribution of the keys. After some not so good experiences, I abandonned IntMap and just use Map. Anyone know? Also - I don't have insertWith' for IntMap.
and there's no IntMap.insertWith' as of 6.6.1
The mtl package is now independent of GHC. Regards, Yitz
But I would expect intTable to be faster,
But if I understand correctly, intTable can only deal with integer keys,
whereas BH's original question would have wanted string keys, and I can't
see a way to convert string to int and back.
t.
"Chad Scherrer"
Thomas Hartman wrote:
Since I'm interested in the stack overflow issue, and getting acquainted with quickcheck, I thought I would take this opportunity to compare your ordTable with some code Yitzchak Gale posted earlier, against Ham's original problem.
As far as I can tell, they're the same. They work on lists up to 100000 element lists of strings, but on 10^6 size lists I lose patience waiting for them to finish.
Is there a more scientific way of figuring out if one version is better than the other by using, say profiling tools?
Or by reasoning about the code?
No, measuring actual performance is the only way.
t.
****************************************
import Data.List import qualified Data.Map as M import Control.Arrow import Test.QuickCheck import Test.GenTestData
I couldn't find this one, so I was unable to test anything. [snip] Some ideas: - Your code doesn't contain a main function. Did you compile it? - Strings are lists; storing a string of n characters needs 12*n bytes on 32 bit architectures, and 24*n bytes with 64 bits. 1 million strings with 10 characters each will consume 120MB (or 240MB), without accounting for any overhead for the copying garbage collector. I expect that you can save some memory (and thus garbage collection time) by using Data.ByteString.ByteString (which uses about 32 + n (32 bits) or 64 + n bytes (64 bits), if I remember correctly). - On using Data.Map vs. sort and group: The main advantage of the first approach is that duplicates are eliminated as they are found. So in fact, if you have a list of only 100 different strings, your code will run in constant memory, assuming the list is built lazily. Your test case looks like there are only few duplicates, so I'd expect the Data.Map code to perform a bit worse than the Data.List one. But as I wrote above, profiling is the only way to find out for sure. Bertram
Thomas Hartman wrote:
Since I'm interested in the stack overflow issue, and getting acquainted with quickcheck, I thought I would take this opportunity to compare your ordTable with some code Yitzchak Gale posted earlier, against Ham's original problem.
As far as I can tell, they're the same. They work on lists up to 100000 element lists of strings, but on 10^6 size lists I lose patience waiting for them to finish.
Is there a more scientific way of figuring out if one version is better than the other by using, say profiling tools?
Or by reasoning about the code?
It can be reasoned. Some people know how to do it. No one has written up the method and theory properly. It is currently rather ad hoc. I want to write one in the future. Some of the knowledge is in: http://www.haskell.org/haskellwiki/Stack_overflow http://en.wikibooks.org/wiki/Haskell (Advanced Track, Haskell Performance) Richard Bird's "Introduction to Functional Programming using Haskell", second edition (chapter 7 "Efficiency", but also other chapters contain efficiency discussions) anything that adequately defines lazy evaluation (or whatever evaluation your favourite executor seems to use)
On 10/17/07, Thomas Hartman
Is there a more scientific way of figuring out if one version is better than the other by using, say profiling tools?
Profiling Haskell programs is black magic, but of the sort you learn by having a problem to solve. I don't think it requires special genius, just enough motivation. Profiling the interpreter my team created during the ICFP contest did that to me. The GHC heap profiler looked weird to me at first because you are required to convert its output to a PostScript file. However, it's well worth doing. There are several types of profiles, but you would probably be most interested in the "biographical" profile. The GHC documentation is pretty good and the article "Heap Profiling for Space Efficiency"[1] may also help. The article was written for the nhc compiler but the tools look the same. Performance profiling is easier - it's just dumped as text output when your program runs. GHC's documentation is really good here. One thing I've learned is to look for two things: 1) Functions that do allocations and execute many times 2) Functions that run lots of times #2 is pretty much universal for profiling, but #1 is unique to Haskell (and probably any pure functional language). Sadly none of these technique work for stack overflows. Or, more likely, I haven't learned how to spot them. Luckily the culprit is usually a fold that isn't strict enough. Albert's post about the Bird book is a good pointer. I just read that chapter myself last night, and he gives a very clear explanation of how lazy evaluation works (he calls it 'outermost reduction') and how strictness interacts with laziness. Hope that helps! Justin [1] http://citeseer.ist.psu.edu/runciman96heap.html
participants (12)
-
Albert Y. C. Lai -
Bertram Felgenhauer -
Big_Ham -
Chad Scherrer -
Dan Weston -
Dougal Stanton -
Justin Bailey -
Peter Verswyvelen -
Stefan Holdermans -
Stuart Cook -
Thomas Hartman -
Yitzchak Gale