Hi Nikita,

I've been thinking about data import issues recently as well (in this case, NetCDF files: https://github.com/ian-ross/hnetcdf is kind of nasty and unfinished, but it will get there in the end). The easy access to data files of all different kinds in R is one of its really big advantages. I've done tasks in the past where I've needed to read CSV files, NetCDF files, GeoTIFF data and ESRI shape files, all for the sam job. R handles them all seamlessly, giving you a very open data analysis platform. I'd like Haskell eventually to be a similarly open platform with as easy a workflow as R for data analysis, but it's going to take some work to get there.

As you kind of say, one thing that's tricky is deciding on what matrix libraries to use. I don't have any good ideas about that. With the NetCDF stuff, I've been experimenting with making all of the get and put functions polymorphic in a "store" type, allowing you to read and write Storable.Vectors, Repa arrays and hmatrix arrays. I'm not convinced I'm doing it right, but the alternative is to support only a single array type, which, until the community settles on a single canonical array type (if that's even possible), seems restrictive.

I guess from the perspective of what you could do with your hmatrix-labeled code, aiming for something as flexible as R's read.table (which is specialised for read.csv and variants) that supports at least a few of the common Haskell array types would be nice. However, there's a danger of duplicating some of the good work that's already been done on fast CSV parsers (cassava, csv-conduit and pipes-csv). Cassava, in particular, has a nice lightweight API that's very suitable for interactive work. If you extended the Cassava parser to support a wider range of file formats (like read.table) and added some helpers for converting to array types where it makes sense, that might be enough.

Cheers,

Ian.

On 19 February 2014 20:06, Nikita Karetnikov <nikita@karetnikov.org> wrote:

I like how easy it is to import data in R and Octave. (See [1] for a
typical workflow.) Since I couldnĸt find any matching library on
Hackage, I cooked up my own [2] in a couple of days.

Hereĸs an example. Letĸs start by creating a poorly formatted dataset:

$ cat > test.txt
One Two Longish Four
Foo 1 -2 3.0 4.0
Looooooong 5.0 6.0 -72.0 8.0
Baz 41.0 4.2324234e7 43.0 1.111111144E-12

Then we parse it with ĄreadFileĸ, mangle the data a bit, and display in
GHCi:

ë> import qualified Data.Packed.LMatrix.IO as L
ë> import qualified Data.Packed.LMatrix as L
ë> do m <- L.readFile "test.txt"; return . L.trans . L.reverseRows $ L.map (+1) m
(4><3)
Baz Looooooong Foo
One 42.0 6.0 2.0
Two 4.2324235e7 7.0 -1.0
Longish 44.0 -71.0 4.0
Four 1.0000000000011111 9.0 5.0

Now Iĸm wondering how to make it better. Iĸm planning to add the
documentation, augment the parser to accept CSV, and maybe support other
matrix libraries. Whatĸs missing? Would you like to see it on Hackage?
And if not, why?

[1] http://astrostatistics.psu.edu/su09/lecturenotes/pca.html
[2] https://gitorious.org/hmatrix-labeled

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

--
Ian Ross Tel: +43(0)6804451378 ian@skybluetrades.net www.skybluetrades.net