On 13.11.2014 02:22, Christopher Allen wrote:
I'm working on a Haskell article for https://howistart.org/ which is actually about the rudiments of processing CSV data in Haskell.Thank you, this looks rather useful. I will have a closer look at it for sure. Surprised that csv-conduit was so troublesome. I was in fact expecting/hoping for the opposite. I will just give it a try.
To that end, take a look at my rather messy workspace here: https://github.com/bitemyapp/csvtest
And my in-progress article here: https://github.com/bitemyapp/howistart/blob/master/haskell/1/index.md (please don't post this anywhere, incomplete!)
And here I'll link my notes on profiling memory use with different streaming abstractions: https://twitter.com/bitemyapp/status/531617919181258752
csv-conduit isn't in the test results because I couldn't figure out how to use it. pipes-csv is proper streaming, but uses cassava's parsing machinery and data types. Possibly this is a problem if you have really wide rows but I've never seen anything that would be problematic in that realm even when I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not columns. With csv-conduit you might be able to incrementally process the columns too based on my guess from glancing at the rather scary code.
Let me know if you have any further questions.
Cheers all.
--- Chris Allen
Thanks also to everyone else who replied. Let me add some tidbits to refine the problem space a bit. As I said before the size of the data is around 12GB of csv files. One file per month with
each line representing a user tuning in to a stream:
[date-time-stamp], [radio-stream-name], [duration], [mobile|desktop], [country], [areaCode]
which could be represented as:
data RadioStat = {
rStart :: Integer -- POSIX time stamp
, rStation :: Integer -- index to station map
, rDuration :: Integer -- duration in seconds
, rAgent :: Integer -- index to agent map ("mobile", "desktop", ..)
, rCountry :: Integer -- index to country map ("DE", "CH", ..)
, rArea :: Integer -- German geo location info
}
I guess it parsing a csv into a list of [RadioStat] list and respective entries in a HashMap for the station names
should work just fine (thanks again for your linked material chris).
While this is straight forward I the type of queries I got as examples might indicate that I should not try to
reinvent a query language but look for something else (?). Examples would be
- summarize per day : total listening duration, average listening duration, amount of listening actions
- summarize per day per agent total listening duration, average listening duration, amount of listening actions
I don't think MySQL would perform all that well operating on a table with 125 million entries ;] What approach
would you guys take ?
Thanks for your input and sorry for the broad scope of these questions.
best wishes,
Tobi