
Hello, I have been trying to do some CSV-style processing. My code works fine for small input (up to 10MB), but performs poorly for moderate to large input (it can't seem to finish 100MB of input with 700MB heap space). I have gone through several optimization passes with profiler help, and now I am hoping someone else can point out some other approaches to improving the code's performance (both space and time). The code breaks a large file into smaller files all of whose entries have the same date. thanks, Jeff ------------------------ module Main where import Debug.Trace import Control.Monad import Data.List import qualified Data.ByteString.Lazy.Char8 as B import qualified Data.Map as M import System.Environment (getArgs) dataDir = "dataH/" myRead file = do v <- B.readFile file let (cols' : rows) = map (B.split ',') $ B.lines v let cols = foldl' (\mp (k,v) -> M.insert k v mp) M.empty (zip cols' [0 ..]) return (cols, rows) getColId cols col = M.lookup col cols getCol cols col row = do i <- getColId cols col return $! row!!i dates file nRows = do (cols, rows) <- myRead file let addDate mp row | mp `seq` row `seq` False = undefined | otherwise = do date <- getCol cols (B.pack "\"Date\"") row let old = M.findWithDefault [] date mp return $ M.insert date (row:old) mp res <- foldM addDate M.empty $ take nRows rows mapM_ writeDate $ M.toList res where fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') writeDate (date,rows) = B.writeFile (dataDir++fmt date) (B.unlines $ map (B.join (B.pack ",")) rows) main = do args <- getArgs case args of ["dates",file,nRows] -> dates file (read nRows)

I have been trying to do some CSV-style processing. My code works fine for small input (up to 10MB), but performs poorly for moderate to large input (it can't seem to finish 100MB of input with 700MB heap space). I have gone through several optimization passes with profiler help, and now I am hoping someone else can point out some other approaches to improving the code's performance (both space and time).
The code breaks a large file into smaller files all of whose entries have the same date.
First of all, for this problem and 100 MB input, you have to think carefully about what you do. I'll point out three quirks in your code and afterwards discuss how a better solution looks like.
module Main where
import Debug.Trace import Control.Monad import Data.List import qualified Data.ByteString.Lazy.Char8 as B import qualified Data.Map as M import System.Environment (getArgs)
dataDir = "dataH/"
myRead file = do v <- B.readFile file let (cols' : rows) = map (B.split ',') $ B.lines v let cols = foldl' (\mp (k,v) -> M.insert k v mp) M.empty (zip cols' [0 ..]) return (cols, rows)
getColId cols col = M.lookup col cols
getCol cols col row = do i <- getColId cols col return $! row!!i
dates file nRows = do (cols, rows) <- myRead file let addDate mp row | mp `seq` row `seq` False = undefined | otherwise = do
When using addDate in foldM like below, you certainly don't want to search the cols for the string "Date" again and again everytime addDate is called. The index of the "Date" field is a number determined when parsing the header. That and only that number has to be plugged in here. Thus the next line should read let date = row !! datefieldindex instead of
date <- getCol cols (B.pack "\"Date\"") row
let old = M.findWithDefault [] date mp return $ M.insert date (row:old) mp
The main thing in the code that makes me feel very very ill is the fact that the code is quite "impure" (many many dos). The next line promptly bites back:
res <- foldM addDate M.empty $ take nRows rows
Did you notice this appeal to addDate makes its callee getCol live in the IO-Monad? From the use of M.lookup in getColId, I think you intended to have getCol :: _ -> Maybe _, do you? M.lookup recently got the more general type M.lookup :: Monad m => _ -> m a, so it happily lives in IO. I strongly suggest that you restructure your code and restrict IO to one place only: main = do .. input <- B.readFile file let outs = busywork input mapM_ [writeFile name contents | (name,contents) <- outs] where busywork does the work and is purely functional.
mapM_ writeDate $ M.toList res where fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') writeDate (date,rows) = B.writeFile (dataDir++fmt date)
The following line does unnecessary work: myRead splits a row to get access to the date, but now you join it without having changed any field. It would be wiser to split for the date but to keep an intact copy of the line so that you can pass it here without join. This will reduce memory footprint.
(B.unlines $ map (B.join (B.pack ",")) rows)
main = do args <- getArgs case args of ["dates",file,nRows] -> dates file (read nRows)
To summarize, the code is not very clean and several things slipped in, just as one would expect from an imperative style. The key is to separate concerns, which means here: IO will just do very dumb in and output, fetching the index of the "Date" from the header is handled separately, grouping the lines by date is to be separated from the to-be-output-contents of the lines. Now, we'll think about how to solve the task in reasonable time and space. Your current solutions reads the input and "calculates" all output files before writing them to disk in a final step. This means that the contents of the output files has to be kept in memory. Thus you need least a constant * 100MB of memory. I don't know how ByteString interacts with garbage collection, but it may well be that by keeping the first line (you "cols") in memory, the entire input file contents is also kept which means an additional constant * 100 MB. It is likely that both can be shared if one resolves the code quirks mentioned above. A better solution would be to begin output before the the whole input is read, thus making things more lazy. This can be done the following way: from the input, construct a lazy list of (date,line) pairs. Then, let foldM thread a map from dates to corresponding output file pointers through the list and, at the same time, use the file pointers to output the line in question via appendFile. This way, every line consumed is immediately dispatched to its corresponding output file and things should only require memory for the different dates, besides buffering. In a setting without IO, the task corresponds to the "Optimization Problem" discussed at length in September on this list. The problem here is that writeFile currently cannot be interleaved lazily, this has to be simulated with appendFile. We can read files lazily but we cannot output them lazily. Can this be remedied? Can there be a version of writeFile which is, in a sense, dual to getContents? Regards, apfelmus

Hello,
When using addDate in foldM like below, you certainly don't want to search the cols for the string "Date" again and again everytime addDate is called. The index of the "Date" field is a number determined when parsing the header. That and only that number has to be plugged in here.
Good catch.
The main thing in the code that makes me feel very very ill is the fact that the code is quite "impure" (many many dos). The next line promptly bites back:
res <- foldM addDate M.empty $ take nRows rows
Did you notice this appeal to addDate makes its callee getCol live in the IO-Monad? From the use of M.lookup in getColId, I think you intended to have getCol :: _ -> Maybe _, do you? M.lookup recently got the more general type M.lookup :: Monad m => _ -> m a, so it happily lives in IO.
I was aware of this and counted on the lookup causing the program to stop if the column didn't exist.
The following line does unnecessary work: myRead splits a row to get access to the date, but now you join it without having changed any field. It would be wiser to split for the date but to keep an intact copy of the line so that you can pass it here without join. This will reduce memory footprint.
Another good observation which I missed.
Your current solutions reads the input and "calculates" all output files before writing them to disk in a final step. This means that the contents of the output files has to be kept in memory. Thus you need least a constant * 100MB of memory. I don't know how ByteString interacts with garbage collection, but it may well be that by keeping the first line (you "cols") in memory, the entire input file contents is also kept which means an additional constant * 100 MB. It is likely that both can be shared if one resolves the code quirks mentioned above.
I intentionally chose this design to minimize the amount of file access which seems to be quite slow (see below). After fixing the two slips you pointed out, my code works as expected, processing 100MB in about 1 minute using around 550MB of heap. Here is the good version (where B is Data.ByteString.Lazy.Char8 and M is Data.Map): myRead file = do v <- B.readFile file let (cols' : rows) = B.lines v cols = foldl' (\mp (k,v) -> M.insert k v mp) M.empty $ zip (B.split ',' cols') [0 ..] return (cols, rows) dates file nRows = do (cols, rows) <- myRead file dateIx <- M.lookup (B.pack "\"Date\"") cols let addDate mp row = M.insert date (row:old) mp where date = (B.split ',' row)!!dateIx old = M.findWithDefault [] date mp res = foldl addDate M.empty $ take nRows rows mapM_ writeDate $ M.toList res where fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') writeDate (date,rows) = B.writeFile (dataDir++fmt date) (B.unlines rows)
A better solution would be to begin output before the the whole input is read, thus making things more lazy. This can be done the following way: from the input, construct a lazy list of (date,line) pairs. Then, let foldM thread a map from dates to corresponding output file pointers through the list and, at the same time, use the file pointers to output the line in question via appendFile. This way, every line consumed is immediately dispatched to its corresponding output file and things should only require memory for the different dates, besides buffering.
I tried this approach previously and it seems to be unacceptably slow. I thought the slowness was just due to the fact that file operations are slow, but I'll include my code here (cleaned up to take some of your previous comments into account) just in case there is something subtle I'm missing which is slowing down the code (B, M, and myRead are as above): dates' file nRows = do (cols, rows) <- myRead file dateIx <- M.lookup cols $ B.pack "\"Date\"" let writeDate row = B.appendFile (dataDir++fmt date) row where date = (B.split ',' row)!!dateIx fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') oldFiles <- getDirectoryContents dataDir mapM_ (\f -> catch (removeFile $ dataDir++f) $ const $ return ()) oldFiles mapM_ writeDate $ take nRows rows This code takes over 20 minutes to process 100MB on my machine.
In a setting without IO, the task corresponds to the "Optimization Problem" discussed at length in September on this list. The problem here is that writeFile currently cannot be interleaved lazily, this has to be simulated with appendFile. We can read files lazily but we cannot output them lazily. Can this be remedied? Can there be a version of writeFile which is, in a sense, dual to getContents?
Wouldn't this require blocking IO? thanks for your help, Jeff

A better solution would be to begin output before the the whole input is read, thus making things more lazy. This can be done the following way: from the input, construct a lazy list of (date,line) pairs. Then, let foldM thread a map from dates to corresponding output file pointers through the list and, at the same time, use the file pointers to output the line in question via appendFile. This way, every line consumed is immediately dispatched to its corresponding output file and things should only require memory for the different dates, besides buffering.
I tried this approach previously and it seems to be unacceptably slow. I thought the slowness was just due to the fact that file operations are slow, but I'll include my code here (cleaned up to take some of your previous comments into account) just in case there is something subtle I'm missing which is slowing down the code (B, M, and myRead are as above):
dates' file nRows = do (cols, rows) <- myRead file dateIx <- M.lookup cols $ B.pack "\"Date\"" let writeDate row = B.appendFile (dataDir++fmt date) row where date = (B.split ',' row)!!dateIx fmt = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') oldFiles <- getDirectoryContents dataDir mapM_ (\f -> catch (removeFile $ dataDir++f) $ const $ return ()) oldFiles mapM_ writeDate $ take nRows rows
This code takes over 20 minutes to process 100MB on my machine.
No wonder, as this opens and closes the file on every row. The operating system will be kept quite busy that way! In some sense, your are outsourcing the row collecting M.Map to the OS... Of course, you want to open the files once and dispatch the rows to the different open handles. Here is a version (untested) which either does the read all then write approach (group'n'write) or opens the output files simultaneously (group'n'write2). Note also that there is no need to use M.Map for finding the "Date" keyword in the CSV header (which even hurts performance) though the effects are completely negligible. main = do args <- getArgs case args of ["dates",file,nRows] -> dates file (read nRows) dates file nRows = B.readFile file >>= group'n'write . sugarWithDates . take nRows . B.lines sugarWithDates (header:rows) = map (\r -> (B.split ',' r) !! dateIx, r)) rows where Just dateIx = Data.List.lookup (B.pack "\"Date\"") $ zip (B.split "," header) [0..] formatDate = B.unpack . B.map (\x -> if x == '-' then '_' else x) . B.takeWhile (/= ' ') date2filename = (dataDir ++) . formatDate group'n'write = mapM_ writeDate . M.toList . foldl' addDate M.empty where addDate mp (date,row) = M.insertWith date (\new old -> row:old) [] mp writeDate (date,rows) = B.writeFile (date2filename date) $ B.unlines rows group'n'write2 = foldM addDate M.empty >>= mapM_ hClose . M.elems where addDate mp (date,row) = do (fp,mp) <- case M.lookup date mp of Just fp -> return (fp,mp) _ -> do fp <- openFile (date2filename date) WriteMode return (fp, M.insert date fp mp) hPut fp row return mp The thing that bugs me is that one cannot separate group'n'write2 = write2 . group where (group) is a pure function. I think some kind of lazy writeFile could allow this.
thanks for your help, No problem. :)
Regards, apfelmus

Hello, The (almost) point-free versions run faster than my "fast" imperative version and take up significantly less heap space-- even the version which reads everything and then writes takes up about 1/3 the heap space as my version. I get the impression that point-free style is a preventive measure against space leaks. thanks again, Jeff

The (almost) point-free versions run faster than my "fast" imperative version and take up significantly less heap space-- even the version which reads everything and then writes takes up about 1/3 the heap space as my version.
That was not intended, though I'm very pleased :-D
I get the impression that point-free style is a preventive measure against space leaks.
Ah, good point, I didn't think about it that way. Point-less makes a bit sure that old values are not leaking around as they are not passed through the "pipe". Yet, I'm a bit astonished. I thought that when compiling with -O2, cosmetic changes should become negligible. Perhaps the strict foldl' has an effect? Regards, afpelmus

Hello,
Yet, I'm a bit astonished. I thought that when compiling with -O2, cosmetic changes should become negligible. Perhaps the strict foldl' has an effect?
Perhaps... but I doubt that is the main reason. At the moment I have no idea why there is such a discrepancy between the heap usages... A big part of why the solutions you crafted work so efficiently is that they take advantage of the fact that the rows will be written out exactly as they are read in. I wanted to see if a more general code could maintain the same efficiency. Here is some code to read in a file, write out a file, and do selections-- the idea is that CSV files are internally represented and manipulated as [[ByteString]]. readCSV file = do v <- B.readFile file return $ map (B.split ',') $ B.lines v writeCSV file tbl = do h <- openFile file WriteMode let writeRow = mapM_ (B.hPut h) . (++ [nl]) . intersperse comma mapM_ writeRow tbl hClose h where comma = B.singleton ',' nl = B.singleton '\n' select targs test (cols : rows) = map narrow (cols : passTest rows) where myFilter = map snd . filter fst passTest = myFilter . map (\row -> (runReader test (zip cols [0..], row), row)) narrow = myFilter . zip (map (`elem` targs) cols) col x = do (cols,row) <- ask let Just i = lookup (B.pack x) cols return $ row!!i This code runs reasonably fast-- around 13 seconds to read in a 120MB file (~750000 rows), select half the columns of around 22000 rows randomly distributed throughout the input table, and write a new CSV file. It takes around 90 seconds to just remove some columns from every row in the table and write a new file. So the slow part of the program is probably the writeCSV function. Do you think these times can be improved upon? -Jeff

jeff p wrote:
Hello,
Yet, I'm a bit astonished. I thought that when compiling with -O2, cosmetic changes should become negligible. Perhaps the strict foldl' has an effect?
Perhaps... but I doubt that is the main reason. At the moment I have no idea why there is such a discrepancy between the heap usages...
A big part of why the solutions you crafted work so efficiently is that they take advantage of the fact that the rows will be written out exactly as they are read in. I wanted to see if a more general code could maintain the same efficiency. Here is some code to read in a file, write out a file, and do selections-- the idea is that CSV files are internally represented and manipulated as [[ByteString]].
readCSV file = do v <- B.readFile file return $ map (B.split ',') $ B.lines v
Good, writeCSV writes out every row immediately after it got it. I eliminated (++ [nl]) in the hope of reducing the constant factor slightly. Using difference lists for that is nicer but here you go.
writeCSV file tbl = do h <- openFile file WriteMode mapM_ (writeRow h) tbl hClose h where comma = B.singleton ',' nl = B.singleton '\n' whriteRow h row = mapM_ (B.hPut h) (intersperse comma row) >> B.hPut h nl
Concerning select, one myFilter can be fused away and there is the "transpose trick" for filtering out the columns: columns get filtered once and for all and (map (`elem` tags)) only needs to be computed once. I don't know why the MonadReader is necessary, so I removed it
select targs test (cols : rows) = cols : filterCols (filterRows rows) where filterRows = filter (test cols) myFilter = map snd . filter fst filterCols = transpose . myFilter . zip colflags . transpose colflags = map (`elem` tags) cols
Concerning col, one should share the index i across different rows. The compiler is likely not to do a full laziness transformation as this bears the danger of introducing space leaks (out of the coder's control, that is).
col x cols = \row -> row !! i where Just i = lookup (B.pack x) $ zip cols [0..]
A possible test is then something like
test = (== B.pack "test") . col "COL"
This code runs reasonably fast-- around 13 seconds to read in a 120MB file (~750000 rows), select half the columns of around 22000 rows randomly distributed throughout the input table, and write a new CSV file. It takes around 90 seconds to just remove some columns from every row in the table and write a new file. So the slow part of the program is probably the writeCSV function. Do you think these times can be improved upon?
I hope so... Though the 13 seconds are disproportionally high (only 22000 rows to be written) compared to 90 seconds (750000 rows to be written). Regards, apfelmus

Hello,
Good, writeCSV writes out every row immediately after it got it. I eliminated (++ [nl]) in the hope of reducing the constant factor slightly. Using difference lists for that is nicer but here you go.
I'm not sure how you'd use difference lists here. Also, for some reason GHC runs slightly slower (compiled with -O) using sequencing instead of ++; the version with >> consistently (a very unrigorous consistence) takes between 5 and 8 seconds longer than the version with ++ (when filtering out some columns from every row and writing the new file). Is there some hidden cost in switching from the mapM_ to a direct sequence?
Concerning select, one myFilter can be fused away and there is the "transpose trick" for filtering out the columns: columns get filtered once and for all and (map (`elem` tags)) only needs to be computed once. I don't know why the MonadReader is necessary, so I removed it
You're right, no need for myFilter on the rows. I used the Reader because at first I thought it would make a nicer interface (I have since changed my mind about that) and it didn't seem to hurt performance. I think using transpose requires more work (and is slower) than filtering each row, particularly if the mask (map (`elem` tags)) is only computed once (shouldn't the compiler do that automatically since the expression is constant?). Transpose requires passing over the entire structure at least once (possibly more depending on how clever the compiler is); so that is a minimum of two complete passes over the structure just to transpose and untranspose. Filtering each row is just one pass over the structure (doing very little work assuming the mask is computed once). My (still very unrigorous) performance tests seem to bear this out where the transposed version consistently takes a few seconds longer.
I hope so... Though the 13 seconds are disproportionally high (only 22000 rows to be written) compared to 90 seconds (750000 rows to be written).
Yes, that does seem suspicious. Perhaps the cost of the testing the filter on each row accounts for this? I've attached my current fastest code if you, or anybody, is interested in taking a shot at tightening it further (or just interested in lightweight CSV-like processing). thanks, Jeff ---------- readCSV file = do v <- B.readFile file return $ map (mySplit ',') $ B.lines v writeCSV file tbl = do h <- openFile file WriteMode let writeRow = mapM_ (B.hPut h) . (++ [nl]) . intersperse comma mapM_ writeRow tbl hClose h where comma = B.singleton ',' nl = B.singleton '\n' select targs test (cols : rows) = map narrow (cols : filter (test cols) rows) where narrow = map snd . filter fst . zip mask mask = map (`elem` targs) cols col x cols row = row !! i where Just i = lookup x $ zip cols [0..] {-- A slightly smarter ByteString split to deal with quotes and remove whitespace. This could be optimized somewhat (e.g. folding killspace into mySplit) but the current performance seems to be as good as ByteString.split-- i.e. using ByteString.split versus mySplit doesn't affect running time. --} mySplit c bs = go False 0 bs where go isQuote i xs | i >= B.length xs = [killSpace xs] | isQuote && x0 == '\\' = go True (i+2) xs | x0 == '"' = go (not isQuote) (i+1) xs | not isQuote && x0 == c = killSpace x' : go False 0 (B.tail xs') | otherwise = go isQuote (i+1) xs where x0 = B.index xs i (x',xs') = B.splitAt i xs killSpace = B.dropWhile isSpace . dropEndWhile isSpace dropEndWhile test x | B.null x = x | otherwise = if test $ B.last x then dropEndWhile test $ B.init x else x

Time to write actual code and do some measurements :) The code attached at the end of the message gets compiled with -O2. Writing a sample test file happens with (writeTest #columns #rows) like in writeTest 4 500000 (~7 seconds, ~770MB heap (=:-o), 3MB test file). I assume the heap spaces from writeTest are everything added together as 'top' does not report any memory bursts. The following matrix represents the performance comparisons between four possibilities. The times get produced by calling (test #matrixrow (filtertest #matrixcol)) map transpose ++[nl] 2.76,2.87,2.83 2.72,2.88,2.96
2.72,2.92,2.87 2.82,2.79,2.88
No significant differences. In a test with more rows, >> seems to perform slightly better (as expected). Transpose is a bit better, too: writeTest 10 750000 (~24 seconds, ~2.8GB heap (=:-o), 15MB test file) map transpose ++[nl] 3.50,3.59,3.42 3.23,3.26,3.29,3.19
3.38,3.41,3.41 3.19,3.14,3.23
Looks like my measurements somewhat disagree with yours. Odd. But note that by discriminating the to be tested functionality on run-time, the compiler gets no chance to optimize things for the particular case. So in reality, (++[nl]) could trigger a good code transformation whereas (>>) does not. Also note that transpose is very lazy and is far cheaper than it looks. Somehow, the 2.8 and 3.5 seconds are not in proportion with respect to the inputs of 3MB and 15MB or the outputs of 590KB and 400KB (yes, the smaller input produces a larger output). Your 13 seconds versus 90 seconds makes this even more puzzling. But it looks like writing a CSV file is far more expensive than reading one. Maybe it's not a good idea to call hPut very often.
the mask (map (`elem` tags) cols) is only computed once (shouldn't the compiler do that automatically since the expression is constant?) [...]
col x cols row = row !! i where Just i = lookup x $ zip cols [0..]
One has to be careful, col x cols = \row -> row !! i where Just i = lookup x $ zip cols [0..] is different as it shares i across all rows. The compiler is likely not to do this easy transformation ("full laziness" transformation), for col because this can introduce space leaks. These are things the programmer should have control over, so no optimization here. See also http://haskell.org/haskellwiki/GHC/FAQ#When_can_I_rely_on_full_laziness.3F I think the reason given there is wrong, it's not about efficiency but about space leaks. The map showcase suggests that (map (`elem` tags) cols) is computed only once, though personally, I don't rely on that (yet). Regards, apfelmus
---------------------------------------- module CSV where
import qualified Data.ByteString.Lazy.Char8 as B import Data.List import System.IO
{------------------------------------------------------------------------------- Reading and writing CSV (comma separated value) files --------------------------------------------------------------------------------}
readCSV :: FilePath -> IO [[B.ByteString]] readCSV file = do v <- B.readFile file return $ map (B.split ',') $ B.lines v
writeCSV :: Int -> FilePath -> [[B.ByteString]] -> IO () writeCSV i file tbl = do h <- openFile file WriteMode mapM_ (writeRow i h) tbl hClose h
writeRow j = case j of 1 -> \h -> mapM_ (B.hPut h) . (++ [nl]) . intersperse comma 2 -> \h row -> (mapM_ (B.hPut h) $ intersperse comma row) >> B.hPut h nl where comma = B.singleton ',' nl = B.singleton '\n'
{------------------------------------------------------------------------------- Processing [[ByteString]] --------------------------------------------------------------------------------} select j targs test (cols : rows) = narrow $ cols : filter (test cols) rows where narrow = colmap j (map snd . filter fst . zip mask) mask = map (`elem` targs) cols
colmap :: Int -> (forall a . [a] -> [a]) -> [[a]] -> [[a]] colmap 1 f = map f colmap 2 f = transpose . f . transpose
col x cols = \row -> row !! i where Just i = lookup x $ zip cols [0..]
if12 = ((== B.pack "2") .) . col (B.pack "1") filtertest j = select j (map B.pack ["1","2","4"]) if12
test i f = readCSV "test.csv" >>= writeCSV i "result.csv" . f
{------------------------------------------------------------------------------- Test cases --------------------------------------------------------------------------------} rotated :: Int -> [[B.ByteString]] rotated n = map (take n) . iterate (drop n) . concat . repeat . map (B.pack . show) $ [1..(n+1)]
writeTest c r = writeCSV 1 "test.csv" . take r . rotated $ c

Hello apfelmus, Wednesday, October 18, 2006, 3:22:31 PM, you wrote:
smaller input produces a larger output). Your 13 seconds versus 90 seconds makes this even more puzzling. But it looks like writing a CSV file is far more expensive than reading one. Maybe it's not a good idea to call hPut very often.
of course :) it takes a lock on each operation. why you can't use pure function approach as i suggested several days ago? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

In fact avoiding space leaks was one of the motivations for our moving to an arrow framework for FRP (now called Yampa). Arrows amount to a point-free coding style, although with "arrow syntax" the cumbersomeness of programming in that style is largely alleviated. -Paul jeff p wrote:
Hello,
The (almost) point-free versions run faster than my "fast" imperative version and take up significantly less heap space-- even the version which reads everything and then writes takes up about 1/3 the heap space as my version.
I get the impression that point-free style is a preventive measure against space leaks.
thanks again, Jeff

Paul Hudak wrote:
In fact avoiding space leaks was one of the motivations for our moving to an arrow framework for FRP (now called Yampa). Arrows amount to a point-free coding style, although with "arrow syntax" the cumbersomeness of programming in that style is largely alleviated.
I think that's an entirely different thing. You changed representation of signal transformers from newtype SF a b = SF ([a] -> [b]) to data SF a b = SF (a -> (b, SF a b)) By enforcing a synchronous processing, you avoid leaking Signals. The latter cannot be isomorphic to a function type (Signal a -> Signal b) for an appropriate Signal, so this implies a point-free style as there is no way to hold stuff of type (Signal a) in variable bindings. This does not mean that there is no point-wise syntax for arrows, it just means that point-wiseness cannot be achieved via variables in the context of function application, i.e. via lambda abstraction. In fact, the main point about Arrows is not that they are an abstraction for computations but that they allow a point-wise syntactic sugar (which stems from their computational being, of course)! The optimization problem here uses (almost) one and the same representation (pure (a -> b), sometimes packed in (a -> IO b)) and point-free turns out to be space friendlier than point-wise. That's very puzzling and i think ghc -O2 should eliminate this. Regards, afpelmus PS: IMHO the semantics of (SF a b) need a real cleanup. (Time -> a) -> (Time -> b) is too crude, because these map transformers even cannot be made an instance of ArrowChoice. Also, Dirac-like peaks (that is Events) do not fit in.

I think that you misunderstood what I said. When we went from FRP to Yampa, we changed from using signals directly, i.e. Signal a, to using signal functions, i.e.:
SF a b = Signal a -> Signal b
When we did this, we made SF abstract, and we adopted the arrow framework to allow composing, etc. signal functions. This meant that you could not get your hands on Signals directly, which helped to prevent space leaks.
What you describe above is a change that we made in the /implementation/ of signal functions (specifically, from streams to continuations), which indeed is an entirely different thing.
You mean that only the fact that (Signal a -> Signal b) got abstract prevented space leaks? Can you give an example? That the implementation with continuations avoids space leaks is clear. The question is whether the old does when using the new primitives. In fact, this amounts to the question whether (inject) as defined in newtype SF' a b = SF' ([a] -> [b]) data SF a b = SF (a -> (b, SF a b)) inject :: SF a b -> SF' a b inject (SF f) (a:as) = let (b,g) = f a in b:inject g as preserves space and time complexity: inject (sf `o` sf') =same space &time= (inject sf) . (inject sf') and the same for all other operations you provide besides `o`. Regards, apfelmus

Hello apfelmus, Thursday, October 12, 2006, 4:42:14 PM, you wrote:
A better solution would be to begin output before the the whole input is read, thus making things more lazy. This can be done the following way: from the input, construct a lazy list of (date,line) pairs. Then, let foldM thread a map from dates to corresponding output file pointers through the list and, at the same time, use the file pointers to output the line in question via appendFile. This way, every line consumed is immediately dispatched to its corresponding output file and things should only require memory for the different dates, besides buffering.
In a setting without IO, the task corresponds to the "Optimization Problem" discussed at length in September on this list. The problem here is that writeFile currently cannot be interleaved lazily, this has to be simulated with appendFile. We can read files lazily but we cannot output them lazily. Can this be remedied? Can there be a version of writeFile which is, in a sense, dual to getContents?
this can be solved in other way. here is a program that reads stdin and puts to stdout lines starting with '>' and to stderr the rest. note that our main processing function is pure: main = do a <- getContents let b = map (processing stdout stderr) (lines a) mapM_ (\(file,line) -> hPutStrLn file line) b processing file1 file2 line = if ">" `isPrefixOf` line then (file1,line) else (file2,line) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Bulat, sorry, I just completely forgot to write down the answer for your post.
Can this be remedied? Can there be a version of writeFile which is, in a sense, dual to getContents?
this can be solved in other way. here is a program that reads stdin and puts to stdout lines starting with '>' and to stderr the rest. note that our main processing function is pure:
main = do a <- getContents let b = map (processing stdout stderr) (lines a) mapM_ (\(file,line) -> hPutStrLn file line) b
processing file1 file2 line = if ">" `isPrefixOf` line then (file1,line) else (file2,line)
(processing) does grouping, but only a part of it. The task of collecting the writes to the different files is still left to the operating system. Furthermore, this code will have a hard time to extend to a dynamical count of files. I imagined that there might be a way where (processing) already does the grouping work and trashes the order in which things were read in and a lazy write would reconstruct the order by laziness and interleave appropriate write calls. But I now think this is not possible. So any grouping function like group :: Input -> Data.Map Key [Value] trashes the order in which different data was read and there cannot be a reconstruction. Of course, if one uses a different grouping data structure which still keeps track of the order of data arrival, i.e. group :: Input -> Data.Magic.Groupy Key Value reconstruction becomes possible. Indeed, (IO a) can be used as "grouping data structure" by virtue of insert :: Key -> Value -> IO () insert = writeFile and this is too short to merit the boilerplate of an additional purely functional abstract data structure between the grouping and the file writing part. One simply does not gain additional clarity. Regards, apfelmus
participants (4)
-
apfelmus@quantentunnel.de
-
Bulat Ziganshin
-
jeff p
-
Paul Hudak