[Haskell-cafe] Re: optimization help

12 Oct 2006

      ...
...
A better solution would be to begin output before the the whole input is
read, thus making things more lazy. This can be done the following way:
from the input, construct a lazy list of (date,line) pairs. Then, let
foldM thread a map from dates to corresponding output file pointers
through the list and, at the same time, use the file pointers to output
the line in question via appendFile. This way, every line consumed is
immediately dispatched to its corresponding output file and things
should only require memory for the different dates, besides buffering.
I tried this approach previously and it seems to be unacceptably slow.
I thought the slowness was just due to the fact that file operations
are slow, but I'll include my code here (cleaned up to take some of
your previous comments into account) just in case there is something
subtle I'm missing which is slowing down the code (B, M, and myRead
are as above):
dates' file nRows = do
 (cols, rows) <- myRead file
 dateIx <- M.lookup cols $ B.pack "\"Date\""
 let writeDate row = B.appendFile (dataDir++fmt date) row where
         date = (B.split ',' row)!!dateIx
         fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) .
B.takeWhile (/= ' ')
 oldFiles <- getDirectoryContents dataDir
 mapM_ (\f -> catch (removeFile $ dataDir++f) $ const $ return ()) oldFiles
 mapM_ writeDate $ take nRows rows
This code takes over 20 minutes to process 100MB on my machine.
No wonder, as this opens and closes the file on every row. The operating
system will be kept quite busy that way! In some sense, your are
outsourcing the row collecting M.Map to the OS... Of course, you want to
open the files once and dispatch the rows to the different open handles.

Here is a version (untested) which either does the read all then write
approach (group'n'write) or opens the output files simultaneously
(group'n'write2). Note also that there is no need to use M.Map for
finding the "Date" keyword in the CSV header (which even hurts
performance) though the effects are completely negligible.

main = do
  args <- getArgs
  case args of
    ["dates",file,nRows] -> dates file (read nRows)

dates file nRows =
    B.readFile file >>=
        group'n'write . sugarWithDates . take nRows . B.lines

sugarWithDates (header:rows) =
    map (\r -> (B.split ',' r) !! dateIx, r)) rows
    where
    Just dateIx = Data.List.lookup (B.pack "\"Date\"") $
        zip (B.split "," header) [0..]

formatDate    = B.unpack .
    B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ')
date2filename = (dataDir ++) . formatDate

group'n'write = mapM_ writeDate . M.toList . foldl' addDate M.empty
    where
    addDate mp (date,row) =
        M.insertWith date (\new old -> row:old) [] mp
    writeDate (date,rows) =
        B.writeFile (date2filename date) $ B.unlines rows

group'n'write2 =
    foldM addDate M.empty >>= mapM_ hClose . M.elems
    where
    addDate mp (date,row) = do
        (fp,mp) <- case M.lookup date mp of
            Just fp -> return (fp,mp)
            _       -> do
                fp <- openFile (date2filename date) WriteMode
                return (fp, M.insert date fp mp)
        hPut fp row
        return mp

The thing that bugs me is that one cannot separate
    group'n'write2 = write2 . group
where (group) is a pure function.
I think some kind of lazy writeFile could allow this.
...
thanks for your help,
No problem. :)
Regards,
apfelmus

[Haskell-cafe] Re: optimization help

apfelmus＠quantentunnel.de