
Hello,
When using addDate in foldM like below, you certainly don't want to search the cols for the string "Date" again and again everytime addDate is called. The index of the "Date" field is a number determined when parsing the header. That and only that number has to be plugged in here.
Good catch.
The main thing in the code that makes me feel very very ill is the fact that the code is quite "impure" (many many dos). The next line promptly bites back:
res <- foldM addDate M.empty $ take nRows rows
Did you notice this appeal to addDate makes its callee getCol live in the IO-Monad? From the use of M.lookup in getColId, I think you intended to have getCol :: _ -> Maybe _, do you? M.lookup recently got the more general type M.lookup :: Monad m => _ -> m a, so it happily lives in IO.
I was aware of this and counted on the lookup causing the program to stop if the column didn't exist.
The following line does unnecessary work: myRead splits a row to get access to the date, but now you join it without having changed any field. It would be wiser to split for the date but to keep an intact copy of the line so that you can pass it here without join. This will reduce memory footprint.
Another good observation which I missed.
Your current solutions reads the input and "calculates" all output files before writing them to disk in a final step. This means that the contents of the output files has to be kept in memory. Thus you need least a constant * 100MB of memory. I don't know how ByteString interacts with garbage collection, but it may well be that by keeping the first line (you "cols") in memory, the entire input file contents is also kept which means an additional constant * 100 MB. It is likely that both can be shared if one resolves the code quirks mentioned above.
I intentionally chose this design to minimize the amount of file access which seems to be quite slow (see below). After fixing the two slips you pointed out, my code works as expected, processing 100MB in about 1 minute using around 550MB of heap. Here is the good version (where B is Data.ByteString.Lazy.Char8 and M is Data.Map): myRead file = do v <- B.readFile file let (cols' : rows) = B.lines v cols = foldl' (\mp (k,v) -> M.insert k v mp) M.empty $ zip (B.split ',' cols') [0 ..] return (cols, rows) dates file nRows = do (cols, rows) <- myRead file dateIx <- M.lookup (B.pack "\"Date\"") cols let addDate mp row = M.insert date (row:old) mp where date = (B.split ',' row)!!dateIx old = M.findWithDefault [] date mp res = foldl addDate M.empty $ take nRows rows mapM_ writeDate $ M.toList res where fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') writeDate (date,rows) = B.writeFile (dataDir++fmt date) (B.unlines rows)
A better solution would be to begin output before the the whole input is read, thus making things more lazy. This can be done the following way: from the input, construct a lazy list of (date,line) pairs. Then, let foldM thread a map from dates to corresponding output file pointers through the list and, at the same time, use the file pointers to output the line in question via appendFile. This way, every line consumed is immediately dispatched to its corresponding output file and things should only require memory for the different dates, besides buffering.
I tried this approach previously and it seems to be unacceptably slow. I thought the slowness was just due to the fact that file operations are slow, but I'll include my code here (cleaned up to take some of your previous comments into account) just in case there is something subtle I'm missing which is slowing down the code (B, M, and myRead are as above): dates' file nRows = do (cols, rows) <- myRead file dateIx <- M.lookup cols $ B.pack "\"Date\"" let writeDate row = B.appendFile (dataDir++fmt date) row where date = (B.split ',' row)!!dateIx fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') oldFiles <- getDirectoryContents dataDir mapM_ (\f -> catch (removeFile $ dataDir++f) $ const $ return ()) oldFiles mapM_ writeDate $ take nRows rows This code takes over 20 minutes to process 100MB on my machine.
In a setting without IO, the task corresponds to the "Optimization Problem" discussed at length in September on this list. The problem here is that writeFile currently cannot be interleaved lazily, this has to be simulated with appendFile. We can read files lazily but we cannot output them lazily. Can this be remedied? Can there be a version of writeFile which is, in a sense, dual to getContents?
Wouldn't this require blocking IO? thanks for your help, Jeff