
If the strings are relatively short, there can be a bottleneck in the current Ord instance for Bytestrings. When lots of them are in a map, the ffi calls to memcmp dominate. I've a fix for this (do it all in Haskell for small strings), and can push it if someone complains some more. jefferson.r.heard:
I thought this was fairly straightforward, but where the marked line finishes in 0.31 seconds on my machine, the actual transpose takes more than 5 minutes. I know it must be possible to read data in haskell faster than this. I'm trying to read a 100MB comma delimited file. I've tried both CSV modules, and these take even longer to read the file. Is there some general best-practice for reading and parsing large amounts of data that I'm not aware of?
I have tried, by the way, a couple of things, including putting a bang (!) before row in transposeRow and using foldr instead of foldl', but that didn't change anything other than force me to increase the stack size to 100M on the command line.
I'm running in the profiler now, and I'll update this, but I thought I would check and see if my head was on remotely straight to begin with.
-- Jeff
--- module ColumnMajorCSV where
import qualified Data.ByteString.Char8 as S import qualified Data.Map as M import qualified Data.List as L
transposeRow cols row = zipWith (:) row cols
transposeCSV :: [[S.ByteString]] -> M.Map String [S.ByteString] transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) spreadsheet) where spreadsheet = L.foldl' transposeRow emptySheet rows emptySheet = take (length header) $ repeat []
dataFromFile :: String -> IO (M.Map String [S.ByteString]) dataFromFile filename = do f <- S.readFile filename print . length . map (S.split ',' $!) . S.lines $ f -- finishes in 0.31 seconds return . transposeCSV . map (S.split ',' $!) . S.lines $ f -- this takes 5 minutes and 6 seconds _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe