another Newbie performance question

Hi everybody, I was doing an assignment in Java for my university concerning a program that reads, modifies and writes CSV files, when I suddenly had the idea of implementing parts of this in Haskell for fun. When I finished the Haskell programm, I was disappointed by the performance: To parse a 200k lines CSV file, insert a line (yes I know i could insert a line without parsing the file, that's just an example) at pos. 199999 and write the file again, the Java program takes 1.1 seconds while the Haskell program takes 12.5 seconds. I have read Don's blog post but am unsure how to implement his tips into my program, as I am still kind of a Haskell beginner. The source code (40 lines incl. comments and empty lines) and the 200k CSV file I used for testing and a smaller CSV file demonstrating the special easy-to-parse CSV syntax are available on my ftp server, ftp://baah.servegame.org/public/haskell The call syntax is <program> <csv file> <line to insert> e.g. main lang.csv "test","this","line" I haven't posted the source code here directly because I thought it might be too long. If someone here finds the time to look at my code and give me some hints, that would really be nice. Regards Philip

Philip Müller
I have read Don's blog post but am unsure how to implement his tips into my program, as I am still kind of a Haskell beginner.
Dan, you seem to have opened a big can of worms. If Haskell is successful, it's your fault. Without doing any compiling, staring at core nor profiling myself, I advice that you 1) traverse the list less often. 2) use ByteStrings 3) use an intermediate data structure that has better insert behaviour than a standard list, have a look at Data.Sequence 4) really, really use ByteStrings 5) listen to me if I tell you to use ByteStrings 6) if you already must write in pointless style, please don't also order the functions in a backward way. -- (c) this sig last receiving data processing entity. Inspect headers for past copyright information. All rights reserved. Unauthorised copying, hiring, renting, public performance and/or broadcasting of this signature prohibited.

On Sat, May 17, 2008 at 5:22 PM, Philip Müller
If someone here finds the time to look at my code and give me some hints, that would really be nice.
A little experimentation reveals that your main bottleneck is readCSVLine: readCSVLine = read . (\x -> "[" ++ x ++ "]") (I just replaced it with (:[]) and it sped up enormously) Thus, I rewrote it myself instead of with read. readCSVLine = unfoldr builder where builder [] = Nothing builder xs = Just $ readField xs readField [] = ([],[]) readField (',':xs) = ([],xs) readField ('"':xs) = let (l,'"':r) = span (/= '"') xs (field,rest) = readField r And decreased the runtime from 17 seconds to 4.2. It probably admits an even better implementation, but it's likely that this is not the bottleneck anymore. The other thing is that the whole table is stored in memory because of your call to "length csv" in doInteraction. This may have been the intent. But if not, you can cut another 1 second off the runtime by "streaming" the file using a function that lazily inserts the line in the second-to-last position. insertLine line csv = let (l,r) = splitLast csv in l ++ [readCSVLine line] ++ r where splitLast [x] = ([],[x]) splitLast (x:xs) = let (l,r) = splitLast xs in (x:l,r) (Note that I got rid of the "pos" parameter) Presumably in a real application you are scanning until you see something and then inserting near that, which is a lazy streamlike operation. There are probably a few other tricks you could do, but I think I identified the main factors. Luke

Achim Schneider wrote:
1) traverse the list less often.
Luke Palmer wrote:
Philip Müller wrote:
If someone here finds the time to look at my code and give me some hints, that would really be nice.
There are probably a few other tricks you could do, but I think I identified the main factors.
List concatenation (++) takes time proportional to the length of the first list. In other words, every (xs ++) traverses xs again. So, things like (x ++ "\n") are expensive. Better write writeCSV as writeCSV = unlines . map (concat . intersperse "," . map show) Here, unlines is a library function which implements your (\x -> x ++ "\n") . concat . intersperse "\n" but slightly more efficiently. In general, difference lists are better for output (concatenation) than Strings and ByteStrings. So, if you want to squeeze even more speed out of your program, use those. Regards, apfelmus

You want to lazily read a CSV file? I had a crack at writing a module for that: http://peteg.org/blog/AYAD/Project/2008-04-11-LazyCSVParser.autumn It is not quite RFC compliant (I believe that is clearly commented upon). As I say there, anything based on (older versions of) Parsec is likely to be too strict in the general case. cheers peter

On Sat, 17 May 2008, Luke Palmer wrote:
insertLine line csv = let (l,r) = splitLast csv in l ++ [readCSVLine line] ++ r where splitLast [x] = ([],[x]) splitLast (x:xs) = let (l,r) = splitLast xs in (x:l,r)
(Note that I got rid of the "pos" parameter)
I'd like to have a function like Data.List.viewR :: [a] -> Maybe ([a], a) analogously to Data.Sequence, which is a safe replacement for 'init' and 'last'.
participants (6)
-
Achim Schneider
-
apfelmus
-
Henning Thielemann
-
Luke Palmer
-
Peter Gammie
-
Philip Müller