
I thought this was fairly straightforward, but where the marked line finishes in 0.31 seconds on my machine, the actual transpose takes more than 5 minutes. I know it must be possible to read data in haskell faster than this. I'm trying to read a 100MB comma delimited file. I've tried both CSV modules, and these take even longer to read the file. Is there some general best-practice for reading and parsing large amounts of data that I'm not aware of? I have tried, by the way, a couple of things, including putting a bang (!) before row in transposeRow and using foldr instead of foldl', but that didn't change anything other than force me to increase the stack size to 100M on the command line. I'm running in the profiler now, and I'll update this, but I thought I would check and see if my head was on remotely straight to begin with. -- Jeff --- module ColumnMajorCSV where import qualified Data.ByteString.Char8 as S import qualified Data.Map as M import qualified Data.List as L transposeRow cols row = zipWith (:) row cols transposeCSV :: [[S.ByteString]] -> M.Map String [S.ByteString] transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) spreadsheet) where spreadsheet = L.foldl' transposeRow emptySheet rows emptySheet = take (length header) $ repeat [] dataFromFile :: String -> IO (M.Map String [S.ByteString]) dataFromFile filename = do f <- S.readFile filename print . length . map (S.split ',' $!) . S.lines $ f -- finishes in 0.31 seconds return . transposeCSV . map (S.split ',' $!) . S.lines $ f -- this takes 5 minutes and 6 seconds

If the strings are relatively short, there can be a bottleneck in the current Ord instance for Bytestrings. When lots of them are in a map, the ffi calls to memcmp dominate. I've a fix for this (do it all in Haskell for small strings), and can push it if someone complains some more. jefferson.r.heard:
I thought this was fairly straightforward, but where the marked line finishes in 0.31 seconds on my machine, the actual transpose takes more than 5 minutes. I know it must be possible to read data in haskell faster than this. I'm trying to read a 100MB comma delimited file. I've tried both CSV modules, and these take even longer to read the file. Is there some general best-practice for reading and parsing large amounts of data that I'm not aware of?
I have tried, by the way, a couple of things, including putting a bang (!) before row in transposeRow and using foldr instead of foldl', but that didn't change anything other than force me to increase the stack size to 100M on the command line.
I'm running in the profiler now, and I'll update this, but I thought I would check and see if my head was on remotely straight to begin with.
-- Jeff
--- module ColumnMajorCSV where
import qualified Data.ByteString.Char8 as S import qualified Data.Map as M import qualified Data.List as L
transposeRow cols row = zipWith (:) row cols
transposeCSV :: [[S.ByteString]] -> M.Map String [S.ByteString] transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) spreadsheet) where spreadsheet = L.foldl' transposeRow emptySheet rows emptySheet = take (length header) $ repeat []
dataFromFile :: String -> IO (M.Map String [S.ByteString]) dataFromFile filename = do f <- S.readFile filename print . length . map (S.split ',' $!) . S.lines $ f -- finishes in 0.31 seconds return . transposeCSV . map (S.split ',' $!) . S.lines $ f -- this takes 5 minutes and 6 seconds _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

I've switched to this, which gets rid of the ByteString instances
fairly quickly. None of them make it into the final map. I'm still
not getting any faster response out of it, and it also complains that
my stack size is too small for anything about 128K records or more.
import qualified Data.ByteString.Char8 as S
import qualified Data.Map as M
import qualified Data.List as L
transposeRow cols row = zipWith (:) (map (read . S.unpack) $ row) cols
transposeCSV :: [[S.ByteString]] -> M.Map String [Float]
transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) spreadsheet)
where spreadsheet = L.foldl' transposeRow emptySheet rows
emptySheet = take (length header) $ repeat []
dataFromFile :: String -> IO (M.Map String [Float])
dataFromFile filename = do
f <- S.readFile filename
return . transposeCSV . map (S.split ',' $!) . S.lines $ f
On Tue, Feb 5, 2008 at 1:15 PM, Don Stewart
If the strings are relatively short, there can be a bottleneck in the current Ord instance for Bytestrings. When lots of them are in a map, the ffi calls to memcmp dominate.
I've a fix for this (do it all in Haskell for small strings), and can push it if someone complains some more.
jefferson.r.heard:
I thought this was fairly straightforward, but where the marked line finishes in 0.31 seconds on my machine, the actual transpose takes more than 5 minutes. I know it must be possible to read data in haskell faster than this. I'm trying to read a 100MB comma delimited file. I've tried both CSV modules, and these take even longer to read the file. Is there some general best-practice for reading and parsing large amounts of data that I'm not aware of?
I have tried, by the way, a couple of things, including putting a bang (!) before row in transposeRow and using foldr instead of foldl', but that didn't change anything other than force me to increase the stack size to 100M on the command line.
I'm running in the profiler now, and I'll update this, but I thought I would check and see if my head was on remotely straight to begin with.
-- Jeff
--- module ColumnMajorCSV where
import qualified Data.ByteString.Char8 as S import qualified Data.Map as M import qualified Data.List as L
transposeRow cols row = zipWith (:) row cols
transposeCSV :: [[S.ByteString]] -> M.Map String [S.ByteString] transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) spreadsheet) where spreadsheet = L.foldl' transposeRow emptySheet rows emptySheet = take (length header) $ repeat []
dataFromFile :: String -> IO (M.Map String [S.ByteString]) dataFromFile filename = do f <- S.readFile filename print . length . map (S.split ',' $!) . S.lines $ f -- finishes in 0.31 seconds return . transposeCSV . map (S.split ',' $!) . S.lines $ f -- this takes 5 minutes and 6 seconds _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow. -- Jessica Edwards

Jefferson Heard wrote:
I thought this was fairly straightforward, but where the marked line finishes in 0.31 seconds on my machine, the actual transpose takes more than 5 minutes. I know it must be possible to read data in haskell faster than this.
I took a look into this, writing a similar, but simpler, program. Half of the runtime, and 2/3 of the allocation, is spent in ByteString's split function. The remaining portions are spent in transposing the list. COST CENTRE %time %alloc ticks bytes split 66.7 65.1 56 120130000 xpose 31.0 32.8 26 60618031 read 1.2 2.0 1 3640229 lines 1.2 0.1 1 260002 I've attached two programs to demonstrate the problem. One creates a sample speadsheet; the other transposes it. I spent a little time trying to find a faster replacement for ByteString.split, but no luck before I had to return to other things.

Jefferson Heard wrote:
I thought this was fairly straightforward, but where the marked line finishes in 0.31 seconds on my machine, the actual transpose takes more than 5 minutes. I know it must be possible to read data in [snip]
dataFromFile :: String -> IO (M.Map String [S.ByteString]) dataFromFile filename = do f <- S.readFile filename print . length . map (S.split ',' $!) . S.lines $ f -- finishes in 0.31 seconds
The S.split applications will never be evaluated - the list that you produce is full of thunks of the form (S.split ',' $! <some bytestring>) The $! will only take effect if those thunks are forced, and length doesn't do that. Try print . sum . map (length . S.split ',') . S.lines $ f instead, to force S.split to produce a result. (In fact, S.split is strict in its second argument, so the $! shouldn't have any effect on the running time at all. I didn't measure that though.)
return . transposeCSV . map (S.split ',' $!) . S.lines $ f -- this takes 5 minutes and 6 seconds
HTH, Bertram
participants (4)
-
Bertram Felgenhauer
-
Bryan O'Sullivan
-
Don Stewart
-
Jefferson Heard