
Hello,
Good, writeCSV writes out every row immediately after it got it. I eliminated (++ [nl]) in the hope of reducing the constant factor slightly. Using difference lists for that is nicer but here you go.
I'm not sure how you'd use difference lists here. Also, for some reason GHC runs slightly slower (compiled with -O) using sequencing instead of ++; the version with >> consistently (a very unrigorous consistence) takes between 5 and 8 seconds longer than the version with ++ (when filtering out some columns from every row and writing the new file). Is there some hidden cost in switching from the mapM_ to a direct sequence?
Concerning select, one myFilter can be fused away and there is the "transpose trick" for filtering out the columns: columns get filtered once and for all and (map (`elem` tags)) only needs to be computed once. I don't know why the MonadReader is necessary, so I removed it
You're right, no need for myFilter on the rows. I used the Reader because at first I thought it would make a nicer interface (I have since changed my mind about that) and it didn't seem to hurt performance. I think using transpose requires more work (and is slower) than filtering each row, particularly if the mask (map (`elem` tags)) is only computed once (shouldn't the compiler do that automatically since the expression is constant?). Transpose requires passing over the entire structure at least once (possibly more depending on how clever the compiler is); so that is a minimum of two complete passes over the structure just to transpose and untranspose. Filtering each row is just one pass over the structure (doing very little work assuming the mask is computed once). My (still very unrigorous) performance tests seem to bear this out where the transposed version consistently takes a few seconds longer.
I hope so... Though the 13 seconds are disproportionally high (only 22000 rows to be written) compared to 90 seconds (750000 rows to be written).
Yes, that does seem suspicious. Perhaps the cost of the testing the filter on each row accounts for this? I've attached my current fastest code if you, or anybody, is interested in taking a shot at tightening it further (or just interested in lightweight CSV-like processing). thanks, Jeff ---------- readCSV file = do v <- B.readFile file return $ map (mySplit ',') $ B.lines v writeCSV file tbl = do h <- openFile file WriteMode let writeRow = mapM_ (B.hPut h) . (++ [nl]) . intersperse comma mapM_ writeRow tbl hClose h where comma = B.singleton ',' nl = B.singleton '\n' select targs test (cols : rows) = map narrow (cols : filter (test cols) rows) where narrow = map snd . filter fst . zip mask mask = map (`elem` targs) cols col x cols row = row !! i where Just i = lookup x $ zip cols [0..] {-- A slightly smarter ByteString split to deal with quotes and remove whitespace. This could be optimized somewhat (e.g. folding killspace into mySplit) but the current performance seems to be as good as ByteString.split-- i.e. using ByteString.split versus mySplit doesn't affect running time. --} mySplit c bs = go False 0 bs where go isQuote i xs | i >= B.length xs = [killSpace xs] | isQuote && x0 == '\\' = go True (i+2) xs | x0 == '"' = go (not isQuote) (i+1) xs | not isQuote && x0 == c = killSpace x' : go False 0 (B.tail xs') | otherwise = go isQuote (i+1) xs where x0 = B.index xs i (x',xs') = B.splitAt i xs killSpace = B.dropWhile isSpace . dropEndWhile isSpace dropEndWhile test x | B.null x = x | otherwise = if test $ B.last x then dropEndWhile test $ B.init x else x