
Tobias Pflug
Hi,
just the other day I talked to a friend of mine who works for an online radio service who told me he was currently looking into how best work with assorted usage data: currently 250 million entries as a 12GB in a csv comprising of information such as which channel was tuned in for how long with which user agent and what not.
He accidentally ran into K and Q programming language (*1) which apparently work nicely for this as unfamiliar as it might seem.
This certainly is not my area of expertise at all. I was just wondering how some of you would suggest to approach this with Haskell. How would you most efficiently parse such data evaluating custom queries ?
Thanks for your time, Tobi
[1] (http://en.wikipedia.org/wiki/K_(programming_language) [2] http://en.wikipedia.org/wiki/Q_(programming_language_from_Kx_Systems)
Hi Tobias, I use Haskell and R (and Matlab) at work. You can certainly do data analysis in Haskell; here is a fairly long example http://idontgetoutmuch.wordpress.com/2013/10/23/parking-in-westminster-an-an... haskell/. IIRC the dataset was about 2G so not dissimilar to the one you are thinking of analysing. I didn't seem to need pipes or conduits but just used cassava. The data were plotted on a map of London (yes you can draw maps in Haskell) with diagrams and shapefile (http://hackage.haskell.org/package/shapefile). But R (and pandas in python) make this sort of analysis easier. As a small example, my data contained numbers like -.1.2 and dates and times. R will happily parse these but in Haskell you have to roll your own (not that this is difficult and "someone" ought to write a library like pandas so that the wheel is not continually re-invented). Also R (and python) have extensive data analysis libraries so if e.g. you want to apply Nelder Mead then a very well documented R package exists; I searched in vain for this in Haskell. Similarly, if you want to construct a GARCH model, then there is not only a package but an active community upon whom you can call for help. I have the benefit of being able to use this at work http://ifl2014.github.io/submissions/ifl2014_submission_16.pdf and I am hoping that it will be open-sourced "real soon now" but it will probably not be available in time for your analysis. I should also add that my workflow (for data analysis) in Haskell is similar to that in R. I do a small amount of analysis either in a file or at the command line and usually chart the results again using the command line: http://hackage.haskell.org/package/Chart I haven't had time to try iHaskell but I think the next time I have some data analysis to do I will try it out. http://gibiansky.github.io/IHaskell/demo.html http://andrew.gibiansky.com/blog/haskell/finger-trees/ Finally, doing data analysis is quite different from quality production code. I would imagine turning Haskell data analysis into production code would be a lot easier than doing this in R. Dominic.