
I have been trying to do some CSV-style processing. My code works fine for small input (up to 10MB), but performs poorly for moderate to large input (it can't seem to finish 100MB of input with 700MB heap space). I have gone through several optimization passes with profiler help, and now I am hoping someone else can point out some other approaches to improving the code's performance (both space and time).
The code breaks a large file into smaller files all of whose entries have the same date.
First of all, for this problem and 100 MB input, you have to think carefully about what you do. I'll point out three quirks in your code and afterwards discuss how a better solution looks like.
module Main where
import Debug.Trace import Control.Monad import Data.List import qualified Data.ByteString.Lazy.Char8 as B import qualified Data.Map as M import System.Environment (getArgs)
dataDir = "dataH/"
myRead file = do v <- B.readFile file let (cols' : rows) = map (B.split ',') $ B.lines v let cols = foldl' (\mp (k,v) -> M.insert k v mp) M.empty (zip cols' [0 ..]) return (cols, rows)
getColId cols col = M.lookup col cols
getCol cols col row = do i <- getColId cols col return $! row!!i
dates file nRows = do (cols, rows) <- myRead file let addDate mp row | mp `seq` row `seq` False = undefined | otherwise = do
When using addDate in foldM like below, you certainly don't want to search the cols for the string "Date" again and again everytime addDate is called. The index of the "Date" field is a number determined when parsing the header. That and only that number has to be plugged in here. Thus the next line should read let date = row !! datefieldindex instead of
date <- getCol cols (B.pack "\"Date\"") row
let old = M.findWithDefault [] date mp return $ M.insert date (row:old) mp
The main thing in the code that makes me feel very very ill is the fact that the code is quite "impure" (many many dos). The next line promptly bites back:
res <- foldM addDate M.empty $ take nRows rows
Did you notice this appeal to addDate makes its callee getCol live in the IO-Monad? From the use of M.lookup in getColId, I think you intended to have getCol :: _ -> Maybe _, do you? M.lookup recently got the more general type M.lookup :: Monad m => _ -> m a, so it happily lives in IO. I strongly suggest that you restructure your code and restrict IO to one place only: main = do .. input <- B.readFile file let outs = busywork input mapM_ [writeFile name contents | (name,contents) <- outs] where busywork does the work and is purely functional.
mapM_ writeDate $ M.toList res where fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') writeDate (date,rows) = B.writeFile (dataDir++fmt date)
The following line does unnecessary work: myRead splits a row to get access to the date, but now you join it without having changed any field. It would be wiser to split for the date but to keep an intact copy of the line so that you can pass it here without join. This will reduce memory footprint.
(B.unlines $ map (B.join (B.pack ",")) rows)
main = do args <- getArgs case args of ["dates",file,nRows] -> dates file (read nRows)
To summarize, the code is not very clean and several things slipped in, just as one would expect from an imperative style. The key is to separate concerns, which means here: IO will just do very dumb in and output, fetching the index of the "Date" from the header is handled separately, grouping the lines by date is to be separated from the to-be-output-contents of the lines. Now, we'll think about how to solve the task in reasonable time and space. Your current solutions reads the input and "calculates" all output files before writing them to disk in a final step. This means that the contents of the output files has to be kept in memory. Thus you need least a constant * 100MB of memory. I don't know how ByteString interacts with garbage collection, but it may well be that by keeping the first line (you "cols") in memory, the entire input file contents is also kept which means an additional constant * 100 MB. It is likely that both can be shared if one resolves the code quirks mentioned above. A better solution would be to begin output before the the whole input is read, thus making things more lazy. This can be done the following way: from the input, construct a lazy list of (date,line) pairs. Then, let foldM thread a map from dates to corresponding output file pointers through the list and, at the same time, use the file pointers to output the line in question via appendFile. This way, every line consumed is immediately dispatched to its corresponding output file and things should only require memory for the different dates, besides buffering. In a setting without IO, the task corresponds to the "Optimization Problem" discussed at length in September on this list. The problem here is that writeFile currently cannot be interleaved lazily, this has to be simulated with appendFile. We can read files lazily but we cannot output them lazily. Can this be remedied? Can there be a version of writeFile which is, in a sense, dual to getContents? Regards, apfelmus