Why is this so inefficient?

5 Feb 2008

      I thought this was fairly straightforward, but where the marked line
finishes in 0.31 seconds on my machine, the actual transpose takes
more than 5 minutes.  I know it must be possible to read data in
haskell faster than this.  I'm trying to read a 100MB comma delimited
file.  I've tried both CSV modules, and these take even longer to read
the file.  Is there some general best-practice for reading and parsing
large amounts of data that I'm not aware of?

I have tried, by the way, a couple of things, including putting a bang
(!) before row in transposeRow and using foldr instead of foldl', but
that didn't change anything other than force me to increase the stack
size to 100M on the command line.

I'm running in the profiler now, and I'll update this, but I thought I
would check and see if my head was on remotely straight to begin with.

-- Jeff

---
module ColumnMajorCSV where

import qualified Data.ByteString.Char8 as S
import qualified Data.Map as M
import qualified Data.List as L

transposeRow cols row = zipWith (:) row cols

transposeCSV :: [[S.ByteString]] -> M.Map String [S.ByteString]
transposeCSV (header:rows) = M.fromList (zip (map S.unpack header) spreadsheet)
    where spreadsheet = L.foldl' transposeRow emptySheet rows
          emptySheet = take (length header) $ repeat []

dataFromFile :: String -> IO (M.Map String [S.ByteString])
dataFromFile filename = do
    f <- S.readFile filename
    print . length . map (S.split ',' $!) . S.lines $ f
 -- finishes in 0.31 seconds
    return . transposeCSV . map (S.split ',' $!) . S.lines $ f  --
this takes 5 minutes and 6 seconds

Jefferson Heard

Don Stewart

Jefferson Heard

Bryan O'Sullivan

Bertram Felgenhauer

tags

participants (4)