
A better solution would be to begin output before the the whole input is read, thus making things more lazy. This can be done the following way: from the input, construct a lazy list of (date,line) pairs. Then, let foldM thread a map from dates to corresponding output file pointers through the list and, at the same time, use the file pointers to output the line in question via appendFile. This way, every line consumed is immediately dispatched to its corresponding output file and things should only require memory for the different dates, besides buffering.
I tried this approach previously and it seems to be unacceptably slow. I thought the slowness was just due to the fact that file operations are slow, but I'll include my code here (cleaned up to take some of your previous comments into account) just in case there is something subtle I'm missing which is slowing down the code (B, M, and myRead are as above):
dates' file nRows = do (cols, rows) <- myRead file dateIx <- M.lookup cols $ B.pack "\"Date\"" let writeDate row = B.appendFile (dataDir++fmt date) row where date = (B.split ',' row)!!dateIx fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') oldFiles <- getDirectoryContents dataDir mapM_ (\f -> catch (removeFile $ dataDir++f) $ const $ return ()) oldFiles mapM_ writeDate $ take nRows rows
This code takes over 20 minutes to process 100MB on my machine.
No wonder, as this opens and closes the file on every row. The operating system will be kept quite busy that way! In some sense, your are outsourcing the row collecting M.Map to the OS... Of course, you want to open the files once and dispatch the rows to the different open handles. Here is a version (untested) which either does the read all then write approach (group'n'write) or opens the output files simultaneously (group'n'write2). Note also that there is no need to use M.Map for finding the "Date" keyword in the CSV header (which even hurts performance) though the effects are completely negligible. main = do args <- getArgs case args of ["dates",file,nRows] -> dates file (read nRows) dates file nRows = B.readFile file >>= group'n'write . sugarWithDates . take nRows . B.lines sugarWithDates (header:rows) = map (\r -> (B.split ',' r) !! dateIx, r)) rows where Just dateIx = Data.List.lookup (B.pack "\"Date\"") $ zip (B.split "," header) [0..] formatDate = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') date2filename = (dataDir ++) . formatDate group'n'write = mapM_ writeDate . M.toList . foldl' addDate M.empty where addDate mp (date,row) = M.insertWith date (\new old -> row:old) [] mp writeDate (date,rows) = B.writeFile (date2filename date) $ B.unlines rows group'n'write2 = foldM addDate M.empty >>= mapM_ hClose . M.elems where addDate mp (date,row) = do (fp,mp) <- case M.lookup date mp of Just fp -> return (fp,mp) _ -> do fp <- openFile (date2filename date) WriteMode return (fp, M.insert date fp mp) hPut fp row return mp The thing that bugs me is that one cannot separate group'n'write2 = write2 . group where (group) is a pure function. I think some kind of lazy writeFile could allow this.
thanks for your help, No problem. :)
Regards, apfelmus