
Hi, I'd like to build a histogram, or even just a frequency count of some categorical data from a large database. In a language such as perl I'd do something like this: my %freq; # hash while (my $item = get_next_from_db()){ $freq{$item}++; } and then sum the hash values and divide the value of each key by the sum to get the histogram. Is there an easy way to do the same thing in Haskell? It looked like an easy task but I seem to be having a lot of trouble getting this to work properly, as it doesn't seem to be behaving very lazily. I'm guessing I should be doing something with the State monad, but I'm not very good at using that yet. Ian

It's likely the while loop can be replaced with a fold which has a Map (or IntMap if the key is an Int) as the accumulator. For a large Map you will need to pay attention to inserting elements strictly... The more complicated bit is without any extra knowledge about the implementation of the "database" `get_next_from_db()` is currently magic.

Hi Ian, In case you were looking for an example to get your teeth into you might be interested in these: https://gist.github.com/2876666 These two scripts both serve the same purpose of building a map of word counts from a text file. They both use Data.Text for Unicode IO, but each tests a different structure. Though unordered-containers package with its Data.HashMap is often suggested as an efficient mapping structure, in my case (of these two scripts) the Data.HashTable from standard library wins taking circa half the time to run on the same dataset (though it's not purely functional as its actions operate in the IO monad). Finally, what puzzles me the most, is that a roughly equivalent script in Python which just reads the same datafile into a standard dict performs in about 1/3 of the time of the faster one of the above two and Python's hardly a fast language... Bewildering, indeed. Hope I didn't put you off :)

On 5 June 2012 19:41, Radosław Szymczyszyn
Finally, what puzzles me the most, is that a roughly equivalent script in Python which just reads the same datafile into a standard dict performs in about 1/3 of the time of the faster one of the above two and Python's hardly a fast language... Bewildering, indeed.
Dicts are efficient in Python though (as they are efficiently implemented in C). Python often seems to beat Haskell in micro-benchmarks that just fill a dictionary then run a simple query / summation on it.

Hi Stephen,
I see now I didn't explain myself especially well. One way to build a
frequency count in haskell comes from LYAHFGG, chapter 7:
ghci> map (\l@(x:xs) -> (x,length l)) . group . sort $
[1,1,1,1,2,2,2,2,3,3,2,2,2,5,6,7]
[(1,4),(2,7),(3,2),(5,1),(6,1),(7,1)]
my guess would be that the problem with this, unless haskell is a lot
more magical than I thought, is that it requires all the entries to be
held in memory to do the sort (and then the group I suppose). The
database has hundreds of millions of entries.
My previous example in perl builds results cumulatively as it
retrieves each entry from the db - get_next_from_db() is basically an
iterator that returns undef once it runs out of entries (using DBI,
which is an SQL wrapper very similar to HDBC. I only wrote it that way
to focus on the real problem.) I could also do this in haskell with
something like accumArray but it would require a pass through to know
what all the possible categorical values are first. Also, if I
understand correctly this would also be quite slow.
What I'm really looking for is suggestions about the best or idiomatic
"haskell" way to cumulatively build these results as it goes along, in
a reasonable amount of time. For instance, my knowledge of updatable
data structures in haskell isn't very good. Is there something obvious
or well-known that I'm overlooking? In general, I'm trying to move my
haskell knowledge from toy problem level to being useful for real
world problems.
Ian
On Tue, Jun 5, 2012 at 7:54 AM, Stephen Tetley
It's likely the while loop can be replaced with a fold which has a Map (or IntMap if the key is an Int) as the accumulator. For a large Map you will need to pay attention to inserting elements strictly...
The more complicated bit is without any extra knowledge about the implementation of the "database" `get_next_from_db()` is currently magic.

Hi Ian You really don't want to use a sort as it needs the whole list in memory (or length as that counts its way through a list...). If your data is integers and sparse - IntMap (provided you don't overflow Int size) or the HashMaps suggested by Radoslaw should work. If the data isn't sparse there are mutable arrays or (probably) Data.Vector - I haven't got round to using the latter yet so can't really comment on its suitability with respect to mutability. If your data isn't integers you need to work out how to make a good key from it so you can map key to count. You can "fold" a Map or mutable array through a traversal of the data - lists provide the usual fold (foldr) but if you have data stored in a database there should be libraries for folding with an "iteraratee" - again I haven't got round to using any of the iteratee packages yet so couldn't comment on which one is the most appropriate. Best wishes Stephen

Ian Knopke
Hi,
I'd like to build a histogram, or even just a frequency count of some categorical data from a large database. In a language such as perl I'd do something like this:
my %freq; # hash while (my $item = get_next_from_db()){ $freq{$item}++; }
and then sum the hash values and divide the value of each key by the sum to get the histogram.
Is there an easy way to do the same thing in Haskell? It looked like an easy task but I seem to be having a lot of trouble getting this to work properly, as it doesn't seem to be behaving very lazily. I'm guessing I should be doing something with the State monad, but I'm not very good at using that yet.
I have a package[1] which I'll eventually put up on Hackage for plotting histograms with Chart[2]. The underlying histogram implementation (for homogenous bin widths only) is intended for use with dense data and internally keeps its accumulator in a mutable vector in the ST monad. Feel free to borrow the code (chart-histogram/Numeric/Histogram.hs). Cheers, - Ben [1] https://github.com/bgamari/chart-histogram [2] http://hackage.haskell.org/package/Chart
participants (4)
-
Ben Gamari
-
Ian Knopke
-
Radosław Szymczyszyn
-
Stephen Tetley