
Justin Paston-Cooper
Dear All,
Recently I have been doing a lot of CSV processing. I initially tried to use the Data.Csv (cassava) library provided on Hackage, but I found this to still be too slow for my needs. In the meantime I have reverted to hacking something together in C, but I have been left wondering whether a tidy solution might be possible to implement in Haskell.
Have you tried profiling your cassava implementation? In my experience I've found it's quite quick. If you have an example of a slow path I'm sure Johan (cc'd) would like to know about it.
I would like to build a library that satisfies the following:
1) Run a function <
... -> a_n -> m (Maybe (b_1, ..., b_n))>>, with <<m>> some monad and the <<a>>s and <<b>>s being input and output. 2) Be able to specify a maximum record string length and output record string length, so that the string buffers used for reading and outputting lines can be reused, preventing the need for allocating new strings for each record.
3) Allocate only once, the memory where the parsed input values, and output values are put.
Ultimately this could be rather tricky to enforce. Haskell code generally does a lot of allocation and the RTS is well optimized to handle this. I've often found that trying to shoehorn a non-idiomatic "optimal" imperative approach into Haskell produces worse performance than the more readable, idiomatic approach. I understand this leaves many of your questions unanswered, but I'd give the idiomatic approach a bit more time before trying to coerce C into Haskell. Profile, see where the hotspots are and optimize appropriately. If the profile has you flummoxed, the lists and #haskell are always willing to help given the time. Cheers, - Ben