I wouldn't hold it against csv-conduit too much, conduit and Pipes both take some getting used too and I hadn't used either in anger before I started kicking around the CSV parsing stuff. I was a bit spoiled by how easy Cassava was to use as well.

Thanks to Christopher Reichert's PR, there is an example for csv-conduit as well, so you've now got four ways to try processing CSV, *three* of which are streaming :)

I'd say just try each in turn and see what you're happy with, if you're not married to a particular streaming operation.

>I don't think MySQL would perform all that well operating on a table with 125 million entries ;] What approach
would you guys take ?

Big enough machine with enough memory and it's fine. I used to keep a job queue with a billion rows on MySQL at a gig long ago. Could do it with PostgreSQL pretty easily too. On your personal work machine? I dunno.

Not trying to steer you away from using Haskell here by any means, but if you can process your data in a SQL database efficiently, that's often pretty optimal in terms of speed and ease of use until you start doing more sophisticated analysis. I don't have a lot of experience in data analysis but I knew people to do some preliminary slicing/dicing in SQL before moving onto a building a custom model for understanding the data.

Cheers,
Chris Allen

On Thu, Nov 13, 2014 at 3:37 AM, Tobias Pflug <tobias.pflug@gmx.net> wrote:
On 13.11.2014 02:22, Christopher Allen wrote:
I'm working on a Haskell article for https://howistart.org/ which is actually about the rudiments of processing CSV data in Haskell.

To that end, take a look at my rather messy workspace here: https://github.com/bitemyapp/csvtest

And my in-progress article here: https://github.com/bitemyapp/howistart/blob/master/haskell/1/index.md (please don't post this anywhere, incomplete!)

And here I'll link my notes on profiling memory use with different streaming abstractions: https://twitter.com/bitemyapp/status/531617919181258752

csv-conduit isn't in the test results because I couldn't figure out how to use it. pipes-csv is proper streaming, but uses cassava's parsing machinery and data types. Possibly this is a problem if you have really wide rows but I've never seen anything that would be problematic in that realm even when I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not columns. With csv-conduit you might be able to incrementally process the columns too based on my guess from glancing at the rather scary code.

Let me know if you have any further questions.

Cheers all.

--- Chris Allen


Thank you, this looks rather useful. I will have a closer look at it for sure. Surprised that csv-conduit was so troublesome. I was in fact expecting/hoping for the opposite. I will just give it a try.

Thanks also to everyone else who replied. Let me add some tidbits to refine the problem space a bit. As I said before the size of the data is around 12GB of csv files. One file per month with
each line representing a user tuning in to a stream:

[date-time-stamp], [radio-stream-name], [duration], [mobile|desktop], [country], [areaCode]

which could be represented as:

data RadioStat = {
                   rStart     :: Integer      -- POSIX time stamp
                 , rStation   :: Integer      -- index to station map
                 , rDuration  :: Integer      -- duration in seconds
                 , rAgent     :: Integer      -- index to agent map ("mobile", "desktop", ..)
                 , rCountry   :: Integer      -- index to country map ("DE", "CH", ..)
                 , rArea      :: Integer      -- German geo location info
                 }

I guess it parsing a csv into a list of [RadioStat] list and respective entries in a HashMap for the station names
should work just fine (thanks again for your linked material chris).

While this is straight forward I the type of queries I got as examples might indicate that I should not try to
reinvent a query language but look for something else (?). Examples would be

- summarize per day : total listening duration, average listening duration, amount of listening actions
- summarize per day per agent total listening duration, average listening duration, amount of listening actions

I don't think MySQL would perform all that well operating on a table with 125 million entries ;] What approach
would you guys take ?

Thanks for your input and sorry for the broad scope of these questions.
best wishes,
Tobi