Re: [Haskell-cafe] Re: optimization help

12 Oct 2006

      Hello,
...
When using addDate in foldM like below, you certainly don't want to
search the cols for the string "Date" again and again everytime addDate
is called. The index of the "Date" field is a number determined when
parsing the header. That and only that number has to be plugged in here.
Good catch.
...
The main thing in the code that makes me feel very very ill is the fact
that the code is quite "impure" (many many dos). The next line promptly
bites back:
...
res <- foldM addDate M.empty $ take nRows rows
Did you notice this appeal to addDate makes its callee getCol live in
the IO-Monad? From the use of M.lookup in getColId, I think you intended
to have getCol :: _ -> Maybe _, do you? M.lookup recently got the more
general type
   M.lookup :: Monad m => _ -> m a,
so it happily lives in IO.
I was aware of this and counted on the lookup causing the program to
stop if the column didn't exist.
...
The following line does unnecessary work: myRead splits a row to get
access to the date, but now you join it without having changed any
field. It would be wiser to split for the date but to keep an intact
copy of the line so that you can pass it here without join. This will
reduce memory footprint.
Another good observation which I missed.
...
Your current solutions reads the input and "calculates" all output files
before writing them to disk in a final step. This means that the
contents of the output files has to be kept in memory. Thus you need
least a constant * 100MB of memory. I don't know how ByteString
interacts with garbage collection, but it may well be that by keeping
the first line (you "cols") in memory, the entire input file contents is
also kept which means an additional constant * 100 MB. It is likely that
both can be shared if one resolves the code quirks mentioned above.
I intentionally chose this design to minimize the amount of file
access which seems to be quite slow (see below).

After fixing the two slips you pointed out, my code works as expected,
processing 100MB in about 1 minute using around 550MB of heap. Here is
the good version (where B is Data.ByteString.Lazy.Char8 and M is
Data.Map):

myRead file = do
  v <- B.readFile file
  let (cols' : rows) = B.lines v
      cols = foldl' (\mp (k,v) -> M.insert k v mp) M.empty $ zip
(B.split ',' cols') [0 ..]
  return (cols, rows)

dates file nRows = do
    (cols, rows) <- myRead file
    dateIx <- M.lookup (B.pack "\"Date\"") cols
    let addDate mp row = M.insert date (row:old) mp where
            date = (B.split ',' row)!!dateIx
            old = M.findWithDefault [] date mp
        res = foldl addDate M.empty $ take nRows rows
    mapM_ writeDate $ M.toList res
  where
    fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) .
B.takeWhile (/= ' ')
    writeDate (date,rows) = B.writeFile (dataDir++fmt date) (B.unlines rows)
...
A better solution would be to begin output before the the whole input is
read, thus making things more lazy. This can be done the following way:
from the input, construct a lazy list of (date,line) pairs. Then, let
foldM thread a map from dates to corresponding output file pointers
through the list and, at the same time, use the file pointers to output
the line in question via appendFile. This way, every line consumed is
immediately dispatched to its corresponding output file and things
should only require memory for the different dates, besides buffering.
I tried this approach previously and it seems to be unacceptably slow.
I thought the slowness was just due to the fact that file operations
are slow, but I'll include my code here (cleaned up to take some of
your previous comments into account) just in case there is something
subtle I'm missing which is slowing down the code (B, M, and myRead
are as above):

dates' file nRows = do
  (cols, rows) <- myRead file
  dateIx <- M.lookup cols $ B.pack "\"Date\""
  let writeDate row = B.appendFile (dataDir++fmt date) row where
          date = (B.split ',' row)!!dateIx
          fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) .
B.takeWhile (/= ' ')
  oldFiles <- getDirectoryContents dataDir
  mapM_ (\f -> catch (removeFile $ dataDir++f) $ const $ return ()) oldFiles
  mapM_ writeDate $ take nRows rows

This code takes over 20 minutes to process 100MB on my machine.
...
In a setting without IO, the task corresponds to the "Optimization
Problem" discussed at length in September on this list. The problem here
is that writeFile currently cannot be interleaved lazily, this has to be
simulated with appendFile. We can read files lazily but we cannot output
them lazily.
Can this be remedied? Can there be a version of writeFile which is, in a
sense, dual to getContents?
Wouldn't this require blocking IO?

thanks for your help,
  Jeff

Re: [Haskell-cafe] Re: optimization help

jeff p