I'm working on a Haskell article for https://howistart.org/ which is actually about the rudiments of processing CSV data in Haskell.

To that end, take a look at my rather messy workspace here: https://github.com/bitemyapp/csvtest

And my in-progress article here: https://github.com/bitemyapp/howistart/blob/master/haskell/1/index.md (please don't post this anywhere, incomplete!)

And here I'll link my notes on profiling memory use with different streaming abstractions: https://twitter.com/bitemyapp/status/531617919181258752

csv-conduit isn't in the test results because I couldn't figure out how to use it. pipes-csv is proper streaming, but uses cassava's parsing machinery and data types. Possibly this is a problem if you have really wide rows but I've never seen anything that would be problematic in that realm even when I did a lot of HDFS/Hadoop ecosystem stuff. AFAICT with pipes-csv you're streaming rows, but not columns. With csv-conduit you might be able to incrementally process the columns too based on my guess from glancing at the rather scary code.

Let me know if you have any further questions.

Cheers all.

--- Chris Allen





On Wed, Nov 12, 2014 at 4:17 PM, Markus Läll <markus.l2ll@gmail.com> wrote:

Hi Tobias,

What he could do is encode the column values to appropriate lengths of Word's to reduce the size -- to make it fit in ram. E.g listening times as seconds, browsers as categorical variables (in statistics terms), etc. If some of the columns are arbitrary length strings, then it seems possible to get 12GB down by more than half.

If he doesn't know Haskell, then I'd suggest using  another language. (Years ago I tried to do a bigger uni project in Haskell-- being a noob --and failed miserably.)

On Nov 12, 2014 10:45 AM, "Tobias Pflug" <tobias.pflug@gmx.net> wrote:
Hi,

just the other day I talked to a friend of mine who works for an online radio service who told me he was currently looking into how best work with assorted usage data: currently 250 million entries as a 12GB in a csv comprising of information such as which channel was tuned in for how long with which user agent and what not.

He accidentally ran into K and Q programming language (*1) which apparently work nicely for this as unfamiliar as it might seem.

This certainly is not my area of expertise at all. I was just wondering how some of you would suggest to approach this with Haskell. How would you most efficiently parse such data evaluating custom queries ?

Thanks for your time,
Tobi

[1] (http://en.wikipedia.org/wiki/K_(programming_language)
[2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems)
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe