data analysis question

Hi, just the other day I talked to a friend of mine who works for an online radio service who told me he was currently looking into how best work with assorted usage data: currently 250 million entries as a 12GB in a csv comprising of information such as which channel was tuned in for how long with which user agent and what not. He accidentally ran into K and Q programming language (*1) which apparently work nicely for this as unfamiliar as it might seem. This certainly is not my area of expertise at all. I was just wondering how some of you would suggest to approach this with Haskell. How would you most efficiently parse such data evaluating custom queries ? Thanks for your time, Tobi [1] (http://en.wikipedia.org/wiki/K_(programming_language) [2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems)

It's hard to answer without knowing what kinds of queries he's doing, but
in the past, I've used csv-conduit to parse the raw data, convert the data
to some Haskell ADT, and then used standard conduit processing to perform
analyses in a streaming manner.
On Wed Nov 12 2014 at 11:45:36 AM Tobias Pflug
Hi,
just the other day I talked to a friend of mine who works for an online radio service who told me he was currently looking into how best work with assorted usage data: currently 250 million entries as a 12GB in a csv comprising of information such as which channel was tuned in for how long with which user agent and what not.
He accidentally ran into K and Q programming language (*1) which apparently work nicely for this as unfamiliar as it might seem.
This certainly is not my area of expertise at all. I was just wondering how some of you would suggest to approach this with Haskell. How would you most efficiently parse such data evaluating custom queries ?
Thanks for your time, Tobi
[1] (http://en.wikipedia.org/wiki/K_(programming_language) [2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems) _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi Tobias,
A friend [is] currently looking into how best work with assorted usage data: currently 250 million entries as a 12GB in a csv comprising of information such as which channel was tuned in for how long with which user agent and what not.
as much as I love Haskell, the tool of choice for data analysis is GNU R, not so much because of the language, but simply because of the vast array of high-quality libraries that cover topics, like statistics, machine learning, visualization, etc. You'll find it at http://www.r-project.org/. If you'd want to analyze 12 GB of data in Haskell, you'd have to jump through all kinds of hoops just to load that CVS file into memory. It's possible, no doubt, but pulling it off efficiently requires a lot of expertise in Haskell that statistics guys don't necessarily have (and arguably they shouldn't have to). The package Rlang-QQ integrates R into Haskell, which might be a nice way to deal with this task, but I have no personal experience with that library, so I'm not sure whether this adds much value. Just my 2 cents, Peter

On 12/11/14 05:21, Peter Simons wrote:
If you'd want to analyze 12 GB of data in Haskell, you'd have to jump through all kinds of hoops just to load that CVS file into memory. It's possible, no doubt, but pulling it off efficiently requires a lot of expertise in Haskell that statistics guys don't necessarily have (and arguably they shouldn't have to).
Well, with Haskell you don't have to load the whole data set into memory, as Michael shows. With R, on the other hand, you do. Besides, if you're not an R expert, and if the analysis you want to do is not readily available, it may be quite a pain to implement in R. As a simple example, I still don't know an acceptable way to write something like zipWith f (tail vec) vec in R. Roman

On 12.11.2014 12:56, Roman Cheplyaka wrote:
On 12/11/14 05:21, Peter Simons wrote:
If you'd want to analyze 12 GB of data in Haskell, you'd have to jump through all kinds of hoops just to load that CVS file into memory. It's possible, no doubt, but pulling it off efficiently requires a lot of expertise in Haskell that statistics guys don't necessarily have (and arguably they shouldn't have to). Well, with Haskell you don't have to load the whole data set into memory, as Michael shows. With R, on the other hand, you do.
That is exactly the thing that came to my mind thinking about R. I haven't actually used R myself but based on what I know and what some googling revealed all analysis would have to happen in-memory. PS: I could be wrong of course ;)

Hi Roman,
With Haskell you don't have to load the whole data set into memory, as Michael shows. With R, on the other hand, you do.
Can you please point me to a reference to back that claim up? I'll offer [1] and [2] as a pretty good indications that you may not be entirely right about this.
Besides, if you're not an R expert, and if the analysis you want to do is not readily available, it may be quite a pain to implement in R.
Actually, implementing sophisticated queries in R is quite easy because the language was specifically designed for that kind of thing. If you have no experience in neither R nor Haskell, then learning R is *far* easier than learning Haskell is because it doesn't aim to be a powerful general-purpose programming language. It aims to be a powerful language for data analysis. Now, one *could* write a DSL in Haskell, of course, that matches R features and accomplishes data analysis tasks in a similarly convenient syntax, etc. But unfortunately no such library exists, and writing one is not trivial task.
I still don't know an acceptable way to write something like zipWith f (tail vec) vec in R.
Why would that be any trouble? What kind of solutions did you find and in what way were they unacceptable? Best regards, Peter [1] http://cran.r-project.org/web/packages/ff/index.html [2] http://cran.r-project.org/web/packages/bigmemory/index.html

On 12/11/14 09:21, Peter Simons wrote:
Hi Roman,
With Haskell you don't have to load the whole data set into memory, as Michael shows. With R, on the other hand, you do.
Can you please point me to a reference to back that claim up?
I'll offer [1] and [2] as a pretty good indications that you may not be entirely right about this.
Ah, great then. My impression was formed after listening to this FLOSS weekly episode: http://twit.tv/show/floss-weekly/306 (starting from 33:55).
Besides, if you're not an R expert, and if the analysis you want to do is not readily available, it may be quite a pain to implement in R.
Actually, implementing sophisticated queries in R is quite easy because the language was specifically designed for that kind of thing. If you have no experience in neither R nor Haskell, then learning R is *far* easier than learning Haskell is because it doesn't aim to be a powerful general-purpose programming language. It aims to be a powerful language for data analysis.
That doesn't match my experience. Maybe it's just me and my unwillingness and write C-like code that traverses arrays by indexes (I know most scientists don't have a problem with that), but I found it hard to express data transformations and queries functionally in R.
I still don't know an acceptable way to write something like zipWith f (tail vec) vec in R.
Why would that be any trouble? What kind of solutions did you find and in what way were they unacceptable?
This was a while ago, and I don't remember what solution I picked up eventually. Of course I could just write a for-loop to populate an array, but I hadn't found anything that matches the simplicity and clarity of the line above. How would you write it in R? Roman

My experience with R is that, while worlds more powerful than the dominant
commercial alternatives (Stata, SAS, it was unintuitive relative to other
general-purpose languages like Python. I wonder/speculate whether it was
distorted by the pull of its statistical applications away from what would
be more natural.
On Wed, Nov 12, 2014 at 6:22 PM, Roman Cheplyaka
On 12/11/14 09:21, Peter Simons wrote:
Hi Roman,
With Haskell you don't have to load the whole data set into memory, as Michael shows. With R, on the other hand, you do.
Can you please point me to a reference to back that claim up?
I'll offer [1] and [2] as a pretty good indications that you may not be entirely right about this.
Ah, great then.
My impression was formed after listening to this FLOSS weekly episode: http://twit.tv/show/floss-weekly/306 (starting from 33:55).
Besides, if you're not an R expert, and if the analysis you want to do is not readily available, it may be quite a pain to implement in R.
Actually, implementing sophisticated queries in R is quite easy because the language was specifically designed for that kind of thing. If you have no experience in neither R nor Haskell, then learning R is *far* easier than learning Haskell is because it doesn't aim to be a powerful general-purpose programming language. It aims to be a powerful language for data analysis.
That doesn't match my experience. Maybe it's just me and my unwillingness and write C-like code that traverses arrays by indexes (I know most scientists don't have a problem with that), but I found it hard to express data transformations and queries functionally in R.
I still don't know an acceptable way to write something like zipWith f (tail vec) vec in R.
Why would that be any trouble? What kind of solutions did you find and in what way were they unacceptable?
This was a while ago, and I don't remember what solution I picked up eventually. Of course I could just write a for-loop to populate an array, but I hadn't found anything that matches the simplicity and clarity of the line above. How would you write it in R?
Roman _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wed, Nov 12, 2014 at 9:42 PM, Jeffrey Brown
My experience with R is that, while worlds more powerful than the dominant commercial alternatives (Stata, SAS, it was unintuitive relative to other general-purpose languages like Python. I wonder/speculate whether it was distorted by the pull of its statistical applications away from what would be more natural.
It is an open source implementation of S ( http://en.wikipedia.org/wiki/S_(programming_language) ) which was developed specifically for statistical applications. I would wonder how much of *that* was shaped by Fortran statistical packages.... -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix, openafs, kerberos, infrastructure, xmonad http://sinenomine.net

On 13/11/2014, at 3:52 pm, Brandon Allbery
It is an open source implementation of S ( http://en.wikipedia.org/wiki/S_(programming_language) ) which was developed specifically for statistical applications. I would wonder how much of *that* was shaped by Fortran statistical packages….
The prehistoric version of S *was* a Fortran statistical package. While the inventors of S were familiar with GLIM, GENSTAT, SPSS, SAS, BMDP, MINITAB, &c. they _were_ at Bell Labs, and so the language looks a lot like C. Indeed, several aspects of S were shaped by UNIX, in particular the way S (but not R) treats the current directory as an “outer block”. Many (even new) R packages are wrappers around Fortran code. However, that has had almost no influence on the language itself. In particular: - arrays are immutable
(v <- 1:5) w <- v w[3] <- 33 w [1] 1 2 33 4 5 v [1] 1 2 3 4 5
- functions are first class values and higher order functions are commonplace - function arguments are evaluated lazily - good style does *NOT* “traverse arrays by indexes” but operates on whole arrays in APL/Fortran 90 style. For example, you do not do for (i in 1:m) for (j in 1:n) r[i,j] <- f(v[i], w[j]) but r <- outer(v, w, f) If you _do_ “express data transformations and queries functionally in R” — which I repeat is native good style — it will perform well; if you “traverse arrays by indexes” you will wish you hadn’t. This is not something that Fortran 66 or Fortran 77 would have taught anyone. Let me put it this way: R is about as close to a functional language as you can get without actually being one. (The implementors of R consciously adopted implementation techniques from Scheme.)

On 13/11/2014, at 3:21 am, Peter Simons
Hi Roman,
With Haskell you don't have to load the whole data set into memory, as Michael shows. With R, on the other hand, you do.
Can you please point me to a reference to back that claim up?
I'll offer [1] and [2] as a pretty good indications that you may not be entirely right about this.
It is *possible* to handle large data sets with R, but it is *usual* to deal with things in memory.
Besides, if you're not an R expert, and if the analysis you want to do is not readily available, it may be quite a pain to implement in R.
A heck of a lot of code in R has been developed by people who think of themselves as statisticians/financial analysts/whatever rather than programmers or “R experts”. There is much to dislike about R (C-like syntax, the ‘interactive if’ trap, the clash of naming styles) but it has to be said that R is a very good for for the data analysis problems S was designed for, and I personally would find it *far* easier to develop such a solution in R than Haskell. (For other problems, of course, it would be the other way around.) Not only does R already have a stupefying number of packages offering all sorts of analyses, so that it’s quite hard to find something that you *have* to implement, there is an extremely active mailing list with searchable archives and full of wizards keen to help. If you *did* have to implement something, you wouldn’t be on your own. The specific case of ‘zipwith f (tail vec) vec’ is easy: (1) vec[-1] is vec without its first element vec[-length(vec)] is vec without its last element (2) cbind(vec[-1], vec[-length(vec)]) is an array with 2 columns. (3) apply(cbind(vec[-1], vec[-length(vec)]), 1, f) applies f to the rows of that matrix. If f returns one number, the answer is a vector; if f returns a row, the answer is a matrix. Example:
vec <- c(1,2,3,4,5) mat <- cbind(vec[-1], vec[-length(vec)]) apply(mat, 1, sum) [1] 3 5 7 9 In this case, you could just do vec[-1] + vec[-length(vec)] and get the same answer.
Oddly enough, one of the tricks for success in R is, like Haskell, to learn your way around the higher-order functions in the library.

Hi Tobias,
What he could do is encode the column values to appropriate lengths of
Word's to reduce the size -- to make it fit in ram. E.g listening times as
seconds, browsers as categorical variables (in statistics terms), etc. If
some of the columns are arbitrary length strings, then it seems possible to
get 12GB down by more than half.
If he doesn't know Haskell, then I'd suggest using another language.
(Years ago I tried to do a bigger uni project in Haskell-- being a noob
--and failed miserably.)
On Nov 12, 2014 10:45 AM, "Tobias Pflug"
Hi,
just the other day I talked to a friend of mine who works for an online radio service who told me he was currently looking into how best work with assorted usage data: currently 250 million entries as a 12GB in a csv comprising of information such as which channel was tuned in for how long with which user agent and what not.
He accidentally ran into K and Q programming language (*1) which apparently work nicely for this as unfamiliar as it might seem.
This certainly is not my area of expertise at all. I was just wondering how some of you would suggest to approach this with Haskell. How would you most efficiently parse such data evaluating custom queries ?
Thanks for your time, Tobi
[1] (http://en.wikipedia.org/wiki/K_(programming_language) [2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems) _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I'm working on a Haskell article for https://howistart.org/ which is
actually about the rudiments of processing CSV data in Haskell.
To that end, take a look at my rather messy workspace here:
https://github.com/bitemyapp/csvtest
And my in-progress article here:
https://github.com/bitemyapp/howistart/blob/master/haskell/1/index.md
(please don't post this anywhere, incomplete!)
And here I'll link my notes on profiling memory use with different
streaming abstractions:
https://twitter.com/bitemyapp/status/531617919181258752
csv-conduit isn't in the test results because I couldn't figure out how to
use it. pipes-csv is proper streaming, but uses cassava's parsing machinery
and data types. Possibly this is a problem if you have really wide rows but
I've never seen anything that would be problematic in that realm even when
I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're
streaming rows, but not columns. With csv-conduit you might be able to
incrementally process the columns too based on my guess from glancing at
the rather scary code.
Let me know if you have any further questions.
Cheers all.
--- Chris Allen
On Wed, Nov 12, 2014 at 4:17 PM, Markus Läll
Hi Tobias,
What he could do is encode the column values to appropriate lengths of Word's to reduce the size -- to make it fit in ram. E.g listening times as seconds, browsers as categorical variables (in statistics terms), etc. If some of the columns are arbitrary length strings, then it seems possible to get 12GB down by more than half.
If he doesn't know Haskell, then I'd suggest using another language. (Years ago I tried to do a bigger uni project in Haskell-- being a noob --and failed miserably.) On Nov 12, 2014 10:45 AM, "Tobias Pflug"
wrote: Hi,
just the other day I talked to a friend of mine who works for an online radio service who told me he was currently looking into how best work with assorted usage data: currently 250 million entries as a 12GB in a csv comprising of information such as which channel was tuned in for how long with which user agent and what not.
He accidentally ran into K and Q programming language (*1) which apparently work nicely for this as unfamiliar as it might seem.
This certainly is not my area of expertise at all. I was just wondering how some of you would suggest to approach this with Haskell. How would you most efficiently parse such data evaluating custom queries ?
Thanks for your time, Tobi
[1] (http://en.wikipedia.org/wiki/K_(programming_language) [2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems) _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wed, Nov 12 2014, Christopher Allen
[Snip] csv-conduit isn't in the test results because I couldn't figure out how to use it. pipes-csv is proper streaming, but uses cassava's parsing machinery and data types. Possibly this is a problem if you have really wide rows but I've never seen anything that would be problematic in that realm even when I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not columns. With csv-conduit you might be able to incrementally process the columns too based on my guess from glancing at the rather scary code.
Any problems in particular? I've had pretty good luck with csv-conduit. However, I have noticed that it's rather picky about type signatures and integrating custom data types isn't straight forward at first. csv-conduit also seems to have drawn inspiration from cassava: http://hackage.haskell.org/package/csv-conduit-0.6.3/docs/Data-CSV-Conduit-C...
[Snip] To that end, take a look at my rather messy workspace here: https://github.com/bitemyapp/csvtest
I've made a PR for the conduit version: https://github.com/bitemyapp/csvtest/pull/1 It could certainly be made more performent but it seems to hold up well in comparison. I would be interested in reading the How I Start Article and hearing more about your conclusions. Is this focused primarily on the memory profile or also speed? Regards, -Christopher
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Memory profiling only to test how stream-y the streaming was. I didn't think perf would be that different between them. The way I had to transform my fold for Pipes was a titch awkward, otherwise happy with it. If people are that interested in the perf side of things I can setup a criterion harness and publish those numbers as well. Mostly I was impressed with: 1. How easy it was to start using the streaming module in Cassava because it's just a Foldable instance. 2. How Pipes used <600kb of memory. Your pull request for csv-conduit looks really clean and nice. I've merged it, thanks for sending it my way! --- Chris Allen On Thu, Nov 13, 2014 at 12:26 AM, Christopher Reichert < creichert07@gmail.com> wrote:
On Wed, Nov 12 2014, Christopher Allen
wrote: [Snip] csv-conduit isn't in the test results because I couldn't figure out how to use it. pipes-csv is proper streaming, but uses cassava's parsing machinery and data types. Possibly this is a problem if you have really wide rows but I've never seen anything that would be problematic in that realm even when I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not columns. With csv-conduit you might be able to incrementally process the columns too based on my guess from glancing at the rather scary code.
Any problems in particular? I've had pretty good luck with csv-conduit. However, I have noticed that it's rather picky about type signatures and integrating custom data types isn't straight forward at first.
csv-conduit also seems to have drawn inspiration from cassava:
http://hackage.haskell.org/package/csv-conduit-0.6.3/docs/Data-CSV-Conduit-C...
[Snip] To that end, take a look at my rather messy workspace here: https://github.com/bitemyapp/csvtest
I've made a PR for the conduit version: https://github.com/bitemyapp/csvtest/pull/1
It could certainly be made more performent but it seems to hold up well in comparison. I would be interested in reading the How I Start Article and hearing more about your conclusions. Is this focused primarily on the memory profile or also speed?
Regards, -Christopher
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Somewhat off topic, but: I said csv-conduit because I have some experience
with it. When we were doing some analytic work at FP Complete, a few of us
analyzed both csv-conduit and cassava, and didn't really have a good feel
for which was the better library. We went with csv-conduit[1], but I'd be
really interested in hearing a comparison of the two libraries from someone
who knows about them.
[1] Don't ask what tipped us in that direction, I honestly don't remember
what it was.
On Thu Nov 13 2014 at 9:24:47 AM Christopher Allen
Memory profiling only to test how stream-y the streaming was. I didn't think perf would be that different between them. The way I had to transform my fold for Pipes was a titch awkward, otherwise happy with it.
If people are that interested in the perf side of things I can setup a criterion harness and publish those numbers as well.
Mostly I was impressed with:
1. How easy it was to start using the streaming module in Cassava because it's just a Foldable instance.
2. How Pipes used <600kb of memory.
Your pull request for csv-conduit looks really clean and nice. I've merged it, thanks for sending it my way!
--- Chris Allen
On Thu, Nov 13, 2014 at 12:26 AM, Christopher Reichert < creichert07@gmail.com> wrote:
On Wed, Nov 12 2014, Christopher Allen
wrote: [Snip] csv-conduit isn't in the test results because I couldn't figure out how to use it. pipes-csv is proper streaming, but uses cassava's parsing machinery and data types. Possibly this is a problem if you have really wide rows but I've never seen anything that would be problematic in that realm even when I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not columns. With csv-conduit you might be able to incrementally process the columns too based on my guess from glancing at the rather scary code.
Any problems in particular? I've had pretty good luck with csv-conduit. However, I have noticed that it's rather picky about type signatures and integrating custom data types isn't straight forward at first.
csv-conduit also seems to have drawn inspiration from cassava:
http://hackage.haskell.org/package/csv-conduit-0.6.3/docs/Data-CSV-Conduit-C...
[Snip] To that end, take a look at my rather messy workspace here: https://github.com/bitemyapp/csvtest
I've made a PR for the conduit version: https://github.com/bitemyapp/csvtest/pull/1
It could certainly be made more performent but it seems to hold up well in comparison. I would be interested in reading the How I Start Article and hearing more about your conclusions. Is this focused primarily on the memory profile or also speed?
Regards, -Christopher
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 13.11.2014 02:22, Christopher Allen wrote:
I'm working on a Haskell article for https://howistart.org/ which is actually about the rudiments of processing CSV data in Haskell.
To that end, take a look at my rather messy workspace here: https://github.com/bitemyapp/csvtest
And my in-progress article here: https://github.com/bitemyapp/howistart/blob/master/haskell/1/index.md (please don't post this anywhere, incomplete!)
And here I'll link my notes on profiling memory use with different streaming abstractions: https://twitter.com/bitemyapp/status/531617919181258752
csv-conduit isn't in the test results because I couldn't figure out how to use it. pipes-csv is proper streaming, but uses cassava's parsing machinery and data types. Possibly this is a problem if you have really wide rows but I've never seen anything that would be problematic in that realm even when I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not columns. With csv-conduit you might be able to incrementally process the columns too based on my guess from glancing at the rather scary code.
Let me know if you have any further questions.
Cheers all.
--- Chris Allen
Thank you, this looks rather useful. I will have a closer look at it for sure. Surprised that csv-conduit was so troublesome. I was in fact expecting/hoping for the opposite. I will just give it a try. Thanks also to everyone else who replied. Let me add some tidbits to refine the problem space a bit. As I said before the size of the data is around 12GB of csv files. One file per month with each line representing a user tuning in to a stream: [date-time-stamp], [radio-stream-name], [duration], [mobile|desktop], [country], [areaCode] which could be represented as: data RadioStat = { rStart :: Integer -- POSIX time stamp , rStation :: Integer -- index to station map , rDuration :: Integer -- duration in seconds , rAgent :: Integer -- index to agent map ("mobile", "desktop", ..) , rCountry :: Integer -- index to country map ("DE", "CH", ..) , rArea :: Integer -- German geo location info } I guess it parsing a csv into a list of [RadioStat] list and respective entries in a HashMap for the station names should work just fine (thanks again for your linked material chris). While this is straight forward I the type of queries I got as examples might indicate that I should not try to reinvent a query language but look for something else (?). Examples would be - summarize per day : total listening duration, average listening duration, amount of listening actions - summarize per day per agent total listening duration, average listening duration, amount of listening actions I don't think MySQL would perform all that well operating on a table with 125 million entries ;] What approach would you guys take ? Thanks for your input and sorry for the broad scope of these questions. best wishes, Tobi

I wouldn't hold it against csv-conduit too much, conduit and Pipes both take some getting used too and I hadn't used either in anger before I started kicking around the CSV parsing stuff. I was a bit spoiled by how easy Cassava was to use as well. Thanks to Christopher Reichert's PR, there is an example for csv-conduit as well, so you've now got four ways to try processing CSV, *three* of which are streaming :) I'd say just try each in turn and see what you're happy with, if you're not married to a particular streaming operation.
I don't think MySQL would perform all that well operating on a table with 125 million entries ;] What approach would you guys take ?
Big enough machine with enough memory and it's fine. I used to keep a job
queue with a billion rows on MySQL at a gig long ago. Could do it with
PostgreSQL pretty easily too. On your personal work machine? I dunno.
Not trying to steer you away from using Haskell here by any means, but if
you can process your data in a SQL database efficiently, that's often
pretty optimal in terms of speed and ease of use until you start doing more
sophisticated analysis. I don't have a lot of experience in data analysis
but I knew people to do some preliminary slicing/dicing in SQL before
moving onto a building a custom model for understanding the data.
Cheers,
Chris Allen
On Thu, Nov 13, 2014 at 3:37 AM, Tobias Pflug
On 13.11.2014 02:22, Christopher Allen wrote:
I'm working on a Haskell article for https://howistart.org/ which is actually about the rudiments of processing CSV data in Haskell.
To that end, take a look at my rather messy workspace here: https://github.com/bitemyapp/csvtest
And my in-progress article here: https://github.com/bitemyapp/ howistart/blob/master/haskell/1/index.md (please don't post this anywhere, incomplete!)
And here I'll link my notes on profiling memory use with different streaming abstractions: https://twitter.com/bitemyapp/ status/531617919181258752
csv-conduit isn't in the test results because I couldn't figure out how to use it. pipes-csv is proper streaming, but uses cassava's parsing machinery and data types. Possibly this is a problem if you have really wide rows but I've never seen anything that would be problematic in that realm even when I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not columns. With csv-conduit you might be able to incrementally process the columns too based on my guess from glancing at the rather scary code.
Let me know if you have any further questions.
Cheers all.
--- Chris Allen
Thank you, this looks rather useful. I will have a closer look at it for sure. Surprised that csv-conduit was so troublesome. I was in fact expecting/hoping for the opposite. I will just give it a try.
Thanks also to everyone else who replied. Let me add some tidbits to refine the problem space a bit. As I said before the size of the data is around 12GB of csv files. One file per month with each line representing a user tuning in to a stream:
[date-time-stamp], [radio-stream-name], [duration], [mobile|desktop], [country], [areaCode]
which could be represented as:
data RadioStat = { rStart :: Integer -- POSIX time stamp , rStation :: Integer -- index to station map , rDuration :: Integer -- duration in seconds , rAgent :: Integer -- index to agent map ("mobile", "desktop", ..) , rCountry :: Integer -- index to country map ("DE", "CH", ..) , rArea :: Integer -- German geo location info }
I guess it parsing a csv into a list of [RadioStat] list and respective entries in a HashMap for the station names should work just fine (thanks again for your linked material chris).
While this is straight forward I the type of queries I got as examples might indicate that I should not try to reinvent a query language but look for something else (?). Examples would be
- summarize per day : total listening duration, average listening duration, amount of listening actions - summarize per day per agent total listening duration, average listening duration, amount of listening actions
I don't think MySQL would perform all that well operating on a table with 125 million entries ;] What approach would you guys take ?
Thanks for your input and sorry for the broad scope of these questions. best wishes, Tobi

Big enough machine with enough memory and it's fine. I used to keep a job queue with a billion rows on MySQL at a gig long ago. Could do it with PostgreSQL pretty easily too. On your personal work machine? I dunno.
Not trying to steer you away from using Haskell here by any means, but if you can process your data in a SQL database efficiently, that's often pretty optimal in terms of speed and ease of use until you start doing more sophisticated analysis. I don't have a lot of experience in data analysis but I knew people to do some preliminary slicing/dicing in SQL before moving onto a building a custom model for understanding the data.
I guess I was just curious what a sensible approach using Haskell would look like and i'll play around with what I know now. If this was from my working place i'd just put it in a database with enough horse power but it's just my curiosity in my spare time, alas.. thank you for your input.

Is there a mailing list for statistics/analytics/simulation/numerical
analysis/etc. using Haskell? If not, I purpose we start one. (Not to
take away from general discussion, but to provide a forum to hash out
these issues among the primary user base).
-M
On Thu, Nov 13, 2014 at 12:46 PM, Tobias Pflug
Big enough machine with enough memory and it's fine. I used to keep a job queue with a billion rows on MySQL at a gig long ago. Could do it with PostgreSQL pretty easily too. On your personal work machine? I dunno.
Not trying to steer you away from using Haskell here by any means, but if you can process your data in a SQL database efficiently, that's often pretty optimal in terms of speed and ease of use until you start doing more sophisticated analysis. I don't have a lot of experience in data analysis but I knew people to do some preliminary slicing/dicing in SQL before moving onto a building a custom model for understanding the data.
I guess I was just curious what a sensible approach using Haskell would look like and i'll play around with what I know now. If this was from my working place i'd just put it in a database with enough horse power but it's just my curiosity in my spare time, alas..
thank you for your input.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

There is #numerical-Haskell on Freenode and an NLP mailing list I believe. Sent from my iPhone
On Nov 13, 2014, at 9:40 PM, Mark Fredrickson
wrote: Is there a mailing list for statistics/analytics/simulation/numerical analysis/etc. using Haskell? If not, I purpose we start one. (Not to take away from general discussion, but to provide a forum to hash out these issues among the primary user base).
-M
On Thu, Nov 13, 2014 at 12:46 PM, Tobias Pflug
wrote: Big enough machine with enough memory and it's fine. I used to keep a job queue with a billion rows on MySQL at a gig long ago. Could do it with PostgreSQL pretty easily too. On your personal work machine? I dunno.
Not trying to steer you away from using Haskell here by any means, but if you can process your data in a SQL database efficiently, that's often pretty optimal in terms of speed and ease of use until you start doing more sophisticated analysis. I don't have a lot of experience in data analysis but I knew people to do some preliminary slicing/dicing in SQL before moving onto a building a custom model for understanding the data.
I guess I was just curious what a sensible approach using Haskell would look like and i'll play around with what I know now. If this was from my working place i'd just put it in a database with enough horse power but it's just my curiosity in my spare time, alas..
thank you for your input.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Mark Fredrickson
Is there a mailing list for statistics/analytics/simulation/numerical analysis/etc. using Haskell? If not, I purpose we start one. (Not to take away from general discussion, but to provide a forum to hash out these issues among the primary user base).
Sadly not but I think there are sufficient numbers of people interested in this subject that it is probably worth setting one up. I really don't like the google group experience but maybe that is the best place to start?

On 13.11.2014 10:37, Tobias Pflug wrote:
data RadioStat = { rStart :: Integer -- POSIX time stamp , rStation :: Integer -- index to station map , rDuration :: Integer -- duration in seconds , rAgent :: Integer -- index to agent map ("mobile", "desktop", ..) , rCountry :: Integer -- index to country map ("DE", "CH", ..) , rArea :: Integer -- German geo location info }
Could you show a sampe record or two? It will be an interesting case to calculate now many bits of information there are vs. how many bits will Haskell need. -- Wojtek

Tobias Pflug
Hi,
just the other day I talked to a friend of mine who works for an online radio service who told me he was currently looking into how best work with assorted usage data: currently 250 million entries as a 12GB in a csv comprising of information such as which channel was tuned in for how long with which user agent and what not.
He accidentally ran into K and Q programming language (*1) which apparently work nicely for this as unfamiliar as it might seem.
This certainly is not my area of expertise at all. I was just wondering how some of you would suggest to approach this with Haskell. How would you most efficiently parse such data evaluating custom queries ?
Thanks for your time, Tobi
[1] (http://en.wikipedia.org/wiki/K_(programming_language) [2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems)
Hi Tobias, I use Haskell and R (and Matlab) at work. You can certainly do data analysis in Haskell; here is a fairly long example http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-an... haskell/. IIRC the dataset was about 2G so not dissimilar to the one you are thinking of analysing. I didn't seem to need pipes or conduits but just used cassava. The data were plotted on a map of London (yes you can draw maps in Haskell) with diagrams and shapefile (http://hackage.haskell.org/package/shapefile). But R (and pandas in python) make this sort of analysis easier. As a small example, my data contained numbers like -.1.2 and dates and times. R will happily parse these but in Haskell you have to roll your own (not that this is difficult and "someone" ought to write a library like pandas so that the wheel is not continually re-invented). Also R (and python) have extensive data analysis libraries so if e.g. you want to apply Nelder Mead then a very well documented R package exists; I searched in vain for this in Haskell. Similarly, if you want to construct a GARCH model, then there is not only a package but an active community upon whom you can call for help. I have the benefit of being able to use this at work http://ifl2014.github.io/submissions/ifl2014_submission_16.pdf and I am hoping that it will be open-sourced "real soon now" but it will probably not be available in time for your analysis. I should also add that my workflow (for data analysis) in Haskell is similar to that in R. I do a small amount of analysis either in a file or at the command line and usually chart the results again using the command line: http://hackage.haskell.org/package/Chart I haven't had time to try iHaskell but I think the next time I have some data analysis to do I will try it out. http://gibiansky.github.io/IHaskell/demo.html http://andrew.gibiansky.com/blog/haskell/finger-trees/ Finally, doing data analysis is quite different from quality production code. I would imagine turning Haskell data analysis into production code would be a lot easier than doing this in R. Dominic.
participants (14)
-
Brandon Allbery
-
Chris Allen
-
Christopher Allen
-
Christopher Reichert
-
Dominic Steinitz
-
Jeffrey Brown
-
Mark Fredrickson
-
Markus Läll
-
Michael Snoyman
-
Peter Simons
-
Richard A. O'Keefe
-
Roman Cheplyaka
-
Tobias Pflug
-
Wojtek Narczyński