On Sun, Jul 20, 2008 at 7:25 AM, Chaddaï Fouché <chaddai.fouche@gmail.com> wrote:

> That's exactly what I thought. But even if I remove the only 'try' I use the
> memory consumption remains unchanged:

It's true, but in your case your output is almost the raw input data,
which means that even without a noxious "try", you still have the
whole file in memory. Well hopefully not with your latest code, which
I would really like to see.
 
Here is the part that actually changed:

-----------
split c str = let (p,ps) = aux str in (p:ps)
    where
      aux     [] = ([],[])
      aux (x:cs) = let (xs,xss) = aux cs in
                   if x == c then ([c],(xs:xss)) else ((x:xs),xss)

splitPred :: (Eq a) => (a -> Bool) -> [a] -> [[a]]
splitPred pr str = let (p,ps) = aux str in (p:ps)
    where
      aux     [] = ([],[])
      aux (x:cs) = let (xs,xss) = aux cs in
                   if pr x then ([],((x:xs):xss)) else ((x:xs),xss)

doOneFile :: String -> IO ()
doOneFile fname = do
  t1 <- getCurrentTime
  doesFileExist (fname ++ ".html") >>= \b -> if b then hPutStrLn stderr $ printf "File already processed, skipping: %s" fname else do
    src <- readFile fname
    out <- openFile (fname ++ ".html") WriteMode
    hSetBuffering out (BlockBuffering (Just 64000))
    hPutStrLn out "<html>"
    hPutStrLn out "<body bgcolor=\"black\">"
    hPutStrLn out "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">"
    hPutStrLn out "<span style=\"font-family: monospace; font-size: 13;\" ><span>"
    let extractData = \p -> case p of
                              Right x -> x
                              Left err -> (trace . show $ err) []
    let srcSplit = splitPred (`elem`"\n") src
    let parsed = concatMap (extractData . parse mainParser fname) srcSplit
    execStateT (hPrintHtml (St id)) (out,emptyStyle) -- wypisujemy pierwszy wiersz
    execStateT (mapM_ hPrintHtml parsed) (out,emptyStyle)
    hPutStrLn out "</span></span>"
    hPutStrLn out "</body>"
    hPutStrLn out "</html>"
    t2 <- getCurrentTime
    hPutStrLn stderr $ printf "File %s processed. It took %s. File size was %d characters." fname (show $ diffUTCTime t2 t1) (length src)
    hClose out
------

The whole file is also attached. You will find there another (worse) implementation of split and a little bit of code similar to thread pool stuff.

On Sun, Jul 20, 2008 at 8:17 AM, John Meacham <john@repetae.net> wrote:
On Sun, Jul 20, 2008 at 02:34:09AM +0400, Bulat Ziganshin wrote:
> i think that Parsec library should hold entire file in memory only when
> you use 'try' for whole file. otherwise it should omit data as
> proceeded

I do not believe that is the case, since the return type of runParser
"Either ParseError a" means that before you can extract the result of
the parse from the 'Right' branch, it must evaluate whether the result
is 'Left' or 'Right' meaning it needs to parse the whole input in order
to determine whether the parse was succesful.

It's true it has to parse the whole file, but it is not true it has to reside in the memory: only the results must be there. In this case, when the result is 1-1 transformation of input, it is true. But consider this program:
----
module Main where
import Text.ParserCombinators.Parsec

par = eof <|> (char 'a' >> par)

alst = take 200000000 (repeat 'a')

main = print (runParser par () "" alst)
----
It runs in constant memory:

$ ./partest.exe +RTS -sstderr
C:\cygwin\home\Metharius\killer\killerPy\ansi2html\partest.exe +RTS -sstderr
Right ()
  84,326,845,636 bytes allocated in the heap
      22,428,536 bytes copied during GC
           9,684 bytes maximum residency (1 sample(s))
          13,848 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 160845 collections,     0 parallel,  0.63s,  0.63s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.02s  (  0.00s elapsed)
  MUT   time   54.31s  ( 54.55s elapsed)
  GC    time    0.63s  (  0.63s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   54.95s  ( 55.17s elapsed)

  %GC time       1.1%  (1.1% elapsed)

  Alloc rate    1,552,176,623 bytes per MUT second

  Productivity  98.8% of total user, 98.4% of total elapsed


Best regards
Christopher Skrzętnicki