ansi2html - one program, several issues

Hi all Recently I wrote a little program that takes text file with ANSI color codes and outputs HTML file. Example input and output files (also attached): http://tener.videomed.com.pl/ansi2html/samo_pk_merusa.log http://tener.videomed.com.pl/ansi2html/samo_pk_merusa.log.html The good thing: it is linear in time. But this is not so much surprising given the problem stated. The bad thing: it is also linear in space and have a feeling that it can run a lot faster. I decided to optimize the code and now I have two things I don't understand and one bug: 1) Profiling shows that very simple functions are source of great memory and time consumption. However, if I turn them off and simply print their input arguments instead, the overall time and memory consumption doesn't change. But now another function is acting badly. My guess: somehow the cost of Parsec code is shifted into whatever function is using it's output. Let's see: doOneFile :: String -> IO () doOneFile fname = do t1 <- getCurrentTime doesFileExist (fname ++ ".html") >>= \b -> if b then hPutStrLn stderr $ printf "File already processed, skipping: %s" fname else do src <- readFile fname out <- openFile (fname ++ ".html") WriteMode let parsed = (parse mainParser fname src) hSetBuffering out (BlockBuffering (Just 64000)) hPutStrLn out "<html>" hPutStrLn out "
" hPutStrLn out "" hPutStrLn out "<span>" execStateT (hPrintHtml (St id)) (out,emptyStyle) -- wypisujemy pierwszy wiersz execStateT (mapM_ *hPrintHtml* (((either (\x -> (trace $ show x) [] ) (id) parsed)) :: [CharOrColor])) (out,emptyStyle) -- *[1]* -- mapM_ (hPutStr out . show) (((either (\x -> (trace $ show x) [] ) (id) ((parse mainParser fname src)))) :: [CharOrColor]) -- *[2]* hPutStrLn out "</span></span>" hPutStrLn out "</body>" hPutStrLn out "</html>" t2 <- getCurrentTime hPutStrLn stderr $ printf "File %s processed. It took %s. File size was %d characters." fname (show $ diffUTCTime t2 t1) (length src) hFlush out hClose out
When profiled: beginSpan Main 32.7 39.7 hPrintHtml Main 21.1 18.4 dataParser Main 17.3 18.6 doOneFile Main 10.4 7.4 colorParser Main 8.7 8.8 justColor Main 3.2 2.0 mainParser Main 3.2 1.9 code0' Main 1.1 1.1 Ok, let's exchange [1] for [2]. Now: doOneFile Main 53.4 59.9 dataParser Main 21.6 21.7 colorParser Main 13.6 10.3 justColor Main 3.8 2.3 mainParser Main 3.6 2.2 code0' Main 1.3 1.2 2) I tried to use both of my processor cores. But simply typing spawning a thread for every file quickly lead to swapping. So instead of this I implemented "thread pool" with fixed number of threads. -- begin code -- type Pool = Chan () takeFromPool :: Pool -> IO () takeFromPool p = readChan p >> return () fillPool :: Pool -> IO () fillPool p = writeChan p () makeThreadPool :: Int -> IO Pool makeThreadPool num = do p <- newChan repeatNum num (fillPool p) return p repeatNum :: Int -> IO () -> IO () repeatNum n act | n > 0 = act >> (repeatNum (n-1) act) | otherwise = return () sparkComp :: Pool -> IO () -> IO (MVar ()) sparkComp pool comp = do takeFromPool pool >>= evaluate mvar <- newEmptyMVar >>= evaluate forkIO $ (comp >> fillPool pool >> putMVar mvar ()) >>= evaluate -- core dumps when changed do forkOS return mvar mapMPar :: (a -> IO ()) -> [a] -> Int -> IO () mapMPar comp lst numT = do tPool <- makeThreadPool numT mvars <- mapM (sparkComp tPool) (map comp lst) mapM_ takeMVar (mvars :: [MVar ()]) return () numThreads = 2 -- end code -- Now, when compiled with -threaded and run with -N2 the program is not faster, but this may be result of -threaded switch. The real puzzle for me: without -threaded switch code using mapMPar is 30% faster then using mapM_, even when numThreads = 1. I have no idea why. 3) When I changed my thread pool implementation to use forkOS instead of forkIO it core dumped when run with -N2 and multiple file input. With -N1 it was ok. Unfortunately I was unable to reproduce this bug after I changed something unrelated to above code. I'm working on GHC version 6.9.20080622, Windows XP. Perhaps someone will succeed reproducing it. Do you have any ideas about this program? Best regards Christopher Skrzętnicki

2008/7/19 Krzysztof Skrzętnicki
Hi all
1) Profiling shows that very simple functions are source of great memory and time consumption. However, if I turn them off and simply print their input arguments instead, the overall time and memory consumption doesn't change. But now another function is acting badly. My guess: somehow the cost of Parsec code is shifted into whatever function is using it's output. Let's see:
Are you using Parsec to parse the whole file ? Then your problem is there : Parsec needs to read and process the whole file before it can give us any output since it thinks it could have to give us an error instead and it can't be sure of that before he has read the whole thing... In your case, your problem is such that you would prefer to treat the file as a stream, isn't it ? There are some parser library that can give output lazily (look at polyparse flavour), another option would be to only use Parsec where you need it and just read and print the ordinary text for example. -- Jedaï

I forgot to mention that the memory consumption is several times higher than file size. On 8,3 Mb file:
532 MB total memory in use (4 MB lost due to fragmentation).
Having that 8 Mb in memory is not the problem. 532 Mb is another story. In
general, the program consumes roughly 64 times more memory than file size
and it scales linearly.
Best regards
Christopher Skrzętnicki
On Sat, Jul 19, 2008 at 9:52 PM, Chaddaï Fouché
2008/7/19 Krzysztof Skrzętnicki
: Hi all
1) Profiling shows that very simple functions are source of great memory and time consumption. However, if I turn them off and simply print their input arguments instead, the overall time and memory consumption doesn't change. But now another function is acting badly. My guess: somehow the cost of Parsec code is shifted into whatever function is using it's output. Let's see:
Are you using Parsec to parse the whole file ? Then your problem is there : Parsec needs to read and process the whole file before it can give us any output since it thinks it could have to give us an error instead and it can't be sure of that before he has read the whole thing... In your case, your problem is such that you would prefer to treat the file as a stream, isn't it ? There are some parser library that can give output lazily (look at polyparse flavour), another option would be to only use Parsec where you need it and just read and print the ordinary text for example.
-- Jedaï

Hello Krzysztof, Sunday, July 20, 2008, 12:49:54 AM, you wrote: on the 32-bit computers 36x memreqs for storing large strings in memory is a rule, on 64-bit ones - 72x
I forgot to mention that the memory consumption is several times higher than file size. On 8,3 Mb file:
532 MB total memory in use (4 MB lost due to fragmentation).
Having that 8 Mb in memory is not the problem. 532 Mb is another story. In general, the program consumes roughly 64 times more memory than file size and it scales linearly.
Best regards Christopher Skrzetnicki
On Sat, Jul 19, 2008 at 9:52 PM, Chaddai Fouche
wrote: 2008/7/19 Krzysztof Skrzetnicki < gtener@gmail.com>: Hi all
1) Profiling shows that very simple functions are source of great memory and time consumption. However, if I turn them off and simply print their input arguments instead, the overall time and memory consumption doesn't change. But now another function is acting badly. My guess: somehow the cost of Parsec code is shifted into whatever function is using it's output. Let's see:
Are you using Parsec to parse the whole file ? Then your problem is there : Parsec needs to read and process the whole file before it can give us any output since it thinks it could have to give us an error instead and it can't be sure of that before he has read the whole thing... In your case, your problem is such that you would prefer to treat the file as a stream, isn't it ? There are some parser library that can give output lazily (look at polyparse flavour), another option would be to only use Parsec where you need it and just read and print the ordinary text for example.
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

2008/7/19 Krzysztof Skrzętnicki
I forgot to mention that the memory consumption is several times higher than file size. On 8,3 Mb file:
532 MB total memory in use (4 MB lost due to fragmentation).
Having that 8 Mb in memory is not the problem. 532 Mb is another story. In general, the program consumes roughly 64 times more memory than file size and it scales linearly.
You should be using ByteString, though this problem would be alleviated if you were consuming the file as a stream. -- Jedaï

On Sat, Jul 19, 2008 at 11:35 PM, Chaddaï Fouché
I forgot to mention that the memory consumption is several times higher
2008/7/19 Krzysztof Skrzętnicki
: than file size. On 8,3 Mb file:
532 MB total memory in use (4 MB lost due to fragmentation).
Having that 8 Mb in memory is not the problem. 532 Mb is another story. In general, the program consumes roughly 64 times more memory than file size and it scales linearly.
You should be using ByteString, though this problem would be alleviated if you were consuming the file as a stream.
Since ANSI color codes doesn't contain characters like newline or space, I have simply split input file into such lines. Now the whole program behaves much better: GC time is below 10% and memory consumption dropped to 74 Mb per thread. It's still a lot of memory though and it certainly holds much more than one line of text. Best regards Christopher Skrzętnicki

Hello Krzysztof, Sunday, July 20, 2008, 1:55:45 AM, you wrote:
532 MB total memory in use (4 MB lost due to fragmentation).
i think that Parsec library should hold entire file in memory only when you use 'try' for whole file. otherwise it should omit data as proceeded -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Sun, Jul 20, 2008 at 12:34 AM, Bulat Ziganshin
Hello Krzysztof,
Sunday, July 20, 2008, 1:55:45 AM, you wrote:
532 MB total memory in use (4 MB lost due to fragmentation).
i think that Parsec library should hold entire file in memory only when you use 'try' for whole file. otherwise it should omit data as proceeded
That's exactly what I thought. But even if I remove the only 'try' I use the memory consumption remains unchanged: C:\cygwin\home\Metharius\killer\KillerPy\ansi2html\ansi2html_old.exe duzy.log +RTS -sstderr File duzy.log processed. It took 5.046875s. File size was 4166578 characters. 3,950,649,704 bytes allocated in the heap 535,544,056 bytes copied during GC 117,603,408 bytes maximum residency (9 sample(s)) 1,647,828 bytes maximum slop 265 MB total memory in use (2 MB lost due to fragmentation) Generation 0: 7527 collections, 0 parallel, 0.86s, 0.86s elapsed Generation 1: 9 collections, 0 parallel, 0.80s, 0.81s elapsed INIT time 0.02s ( 0.00s elapsed) MUT time 3.20s ( 3.63s elapsed) GC time 1.66s ( 1.67s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 4.88s ( 5.30s elapsed) %GC time 34.0% (31.6% elapsed) Alloc rate 1,227,386,315 bytes per MUT second Productivity 65.7% of total user, 60.5% of total elapsed One more thing to note: with partial parsing there is no longer a difference between mapM_ and mapMPar. Best regards Christopher Skrzętnicki

2008/7/20 Krzysztof Skrzętnicki
On Sun, Jul 20, 2008 at 12:34 AM, Bulat Ziganshin
wrote: Hello Krzysztof,
Sunday, July 20, 2008, 1:55:45 AM, you wrote:
532 MB total memory in use (4 MB lost due to fragmentation).
i think that Parsec library should hold entire file in memory only when you use 'try' for whole file. otherwise it should omit data as proceeded
That's exactly what I thought. But even if I remove the only 'try' I use the memory consumption remains unchanged:
It's true, but in your case your output is almost the raw input data, which means that even without a noxious "try", you still have the whole file in memory. Well hopefully not with your latest code, which I would really like to see. -- Jedaï

On Sun, Jul 20, 2008 at 7:25 AM, Chaddaï Fouché
That's exactly what I thought. But even if I remove the only 'try' I use the memory consumption remains unchanged:
It's true, but in your case your output is almost the raw input data, which means that even without a noxious "try", you still have the whole file in memory. Well hopefully not with your latest code, which I would really like to see.
Here is the part that actually changed: ----------- split c str = let (p,ps) = aux str in (p:ps) where aux [] = ([],[]) aux (x:cs) = let (xs,xss) = aux cs in if x == c then ([c],(xs:xss)) else ((x:xs),xss) splitPred :: (Eq a) => (a -> Bool) -> [a] -> [[a]] splitPred pr str = let (p,ps) = aux str in (p:ps) where aux [] = ([],[]) aux (x:cs) = let (xs,xss) = aux cs in if pr x then ([],((x:xs):xss)) else ((x:xs),xss) doOneFile :: String -> IO () doOneFile fname = do t1 <- getCurrentTime doesFileExist (fname ++ ".html") >>= \b -> if b then hPutStrLn stderr $ printf "File already processed, skipping: %s" fname else do src <- readFile fname out <- openFile (fname ++ ".html") WriteMode hSetBuffering out (BlockBuffering (Just 64000)) hPutStrLn out "<html>" hPutStrLn out "
" hPutStrLn out "" hPutStrLn out "<span>" let extractData = \p -> case p of Right x -> x Left err -> (trace . show $ err) [] let srcSplit = splitPred (`elem`"\n") src let parsed = concatMap (extractData . parse mainParser fname) srcSplit execStateT (hPrintHtml (St id)) (out,emptyStyle) -- wypisujemy pierwszy wiersz execStateT (mapM_ hPrintHtml parsed) (out,emptyStyle) hPutStrLn out "</span></span>" hPutStrLn out "</body>" hPutStrLn out "</html>" t2 <- getCurrentTime hPutStrLn stderr $ printf "File %s processed. It took %s. File size was %d characters." fname (show $ diffUTCTime t2 t1) (length src) hClose out
The whole file is also attached. You will find there another (worse)
implementation of split and a little bit of code similar to thread pool
stuff.
On Sun, Jul 20, 2008 at 8:17 AM, John Meacham
On Sun, Jul 20, 2008 at 02:34:09AM +0400, Bulat Ziganshin wrote:
i think that Parsec library should hold entire file in memory only when you use 'try' for whole file. otherwise it should omit data as proceeded
I do not believe that is the case, since the return type of runParser "Either ParseError a" means that before you can extract the result of the parse from the 'Right' branch, it must evaluate whether the result is 'Left' or 'Right' meaning it needs to parse the whole input in order to determine whether the parse was succesful.
It's true it has to parse the whole file, but it is not true it has to reside in the memory: only the results must be there. In this case, when the result is 1-1 transformation of input, it is true. But consider this program: ---- module Main where import Text.ParserCombinators.Parsec par = eof <|> (char 'a' >> par) alst = take 200000000 (repeat 'a') main = print (runParser par () "" alst) ---- It runs in constant memory: $ ./partest.exe +RTS -sstderr C:\cygwin\home\Metharius\killer\killerPy\ansi2html\partest.exe +RTS -sstderr Right () 84,326,845,636 bytes allocated in the heap 22,428,536 bytes copied during GC 9,684 bytes maximum residency (1 sample(s)) 13,848 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 160845 collections, 0 parallel, 0.63s, 0.63s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.02s ( 0.00s elapsed) MUT time 54.31s ( 54.55s elapsed) GC time 0.63s ( 0.63s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 54.95s ( 55.17s elapsed) %GC time 1.1% (1.1% elapsed) Alloc rate 1,552,176,623 bytes per MUT second Productivity 98.8% of total user, 98.4% of total elapsed Best regards Christopher Skrzętnicki

I played with another approach without any parser library, just with plain pattern matching. The idea was to create function to match all different cases of codes. Since I already got most of the code, it was quite easy to do. The core function consist of cases like those:
parse ('\ESC':'[':'1':';':'4':'0':'m':rest) = modifyAndPrint (\x -> x
{ bgcol = light black })>> parse rest
parse ('\ESC':'[':'1':';':'4':'1':'m':rest) = modifyAndPrint (\x -> x
{ bgcol = light red })>> parse rest
parse ('\ESC':'[':'1':';':'4':'2':'m':rest) = modifyAndPrint (\x -> x
{ bgcol = light green })>> parse rest
parse ('\ESC':'[':'1':';':'4':'3':'m':rest) = modifyAndPrint (\x -> x
{ bgcol = light yellow })>> parse rest If you have read the old code you should recognize some parts of it here. It should consume rather constant amount of memory. To my surprise it consumed almost exactly the same amount of memory as the previous program. Turns out the problematic line was this:
hPutStrLn stderr $ printf "File %s processed. It took %s. File size was %d characters." fname (show $ diffUTCTime t2 t1) *(length src)*
It computed length of the input file. Needless to say, because "src" was actually the input file parsed previously, it was all hanging in the memory. Having removed that reference to src both programs (the one that parses input per line and the most recent one) are running in constant memory (2Mb). This doesn't apply to the first program, which has to read whole file before producing any output. And the last note: the new program is also 2x faster, perhaps due to very simple structure that is easy to optimize. It also makes sense now to use mapMPar as it reduces run time by 30%. The full code is in attachments. Best regards Christopher Skrzętnicki

On Sun, Jul 20, 2008 at 02:34:09AM +0400, Bulat Ziganshin wrote:
i think that Parsec library should hold entire file in memory only when you use 'try' for whole file. otherwise it should omit data as proceeded
I do not believe that is the case, since the return type of runParser "Either ParseError a" means that before you can extract the result of the parse from the 'Right' branch, it must evaluate whether the result is 'Left' or 'Right' meaning it needs to parse the whole input in order to determine whether the parse was succesful. This was the reason I made frisby's main parsing routine just be (roughly)
runPeg :: P a -> String -> a
so you have to do something explicit like
runPegMaybe :: P a -> String -> Maybe a runPegMaybe p s = runPeg (fmap Just p // return Nothing) s
to force strictness in the parsing. Though, perhaps parsec is doing something more clever. I do know it uses the one token lookahead trick to determine which branch to take on alternation, but I don't think that solves the issue with parsing the entire file.. John -- John Meacham - ⑆repetae.net⑆john⑈

On Sunday 20 July 2008, John Meacham wrote:
I do not believe that is the case, since the return type of runParser "Either ParseError a" means that before you can extract the result of the parse from the 'Right' branch, it must evaluate whether the result is 'Left' or 'Right' meaning it needs to parse the whole input in order to determine whether the parse was succesful.
This was the reason I made frisby's main parsing routine just be (roughly)
runPeg :: P a -> String -> a
so you have to do something explicit like
runPegMaybe :: P a -> String -> Maybe a runPegMaybe p s = runPeg (fmap Just p // return Nothing) s
to force strictness in the parsing.
Though, perhaps parsec is doing something more clever. I do know it uses the one token lookahead trick to determine which branch to take on alternation, but I don't think that solves the issue with parsing the entire file..
It doesn't stop it from parsing the entire file strictly. However, what it does do is prevent the parser from backtracking out of arbitrary amounts of lookahead. So, unless you use try (which allows for lookahead), when any token is consumed by the parser, it can be garbage collected (assuming the parser is the only thing pointing to the token stream). So, it consumes its input strictly, but with limited overhead (ideally using try only for some small bounded lookahead where it's needed). By contrast, a naive parser combinator of the form: p = foo <|> bar -- or p = try foo <|> bar in parsec Might read the entire file into memory parsing foo, without any of it being garbage collected until completion, in case foo fails and a backtrack to bar is required. Of course, this all assumes that the input to the parser can both be lazily generated, and discarded in pieces (so, not the case if reading an entire file into a strict byte string, for instance). -- Dan

Dan Doel wrote:
On Sunday 20 July 2008, John Meacham wrote:
I do not believe that is the case, since the return type of runParser "Either ParseError a" means that before you can extract the result of the parse from the 'Right' branch, it must evaluate whether the result is 'Left' or 'Right' meaning it needs to parse the whole input in order to determine whether the parse was succesful.
...
It doesn't stop it from parsing the entire file strictly. However, what it does do is prevent the parser from backtracking out of arbitrary amounts of lookahead. So, unless you use try (which allows for lookahead), when any token is consumed by the parser, it can be garbage collected (assuming the parser is the only thing pointing to the token stream). So, it consumes its input strictly, but with limited overhead (ideally using try only for some small bounded lookahead where it's needed).
So with Parsec, you can keep the *input* from filling up memory, but if you do, the *result* will still take up space (e.g. Right (value)). For a simple transformation where the output is a similar string to the input, it will be just as large, so not much space is actually saved (maybe a factor of 2 -- just keeping the output, not also the input), it seems. -Isaac

On Sun, Jul 20, 2008 at 09:55:15AM -0400, Isaac Dupree wrote:
It doesn't stop it from parsing the entire file strictly. However, what it does do is prevent the parser from backtracking out of arbitrary amounts of lookahead. So, unless you use try (which allows for lookahead), when any token is consumed by the parser, it can be garbage collected (assuming the parser is the only thing pointing to the token stream). So, it consumes its input strictly, but with limited overhead (ideally using try only for some small bounded lookahead where it's needed).
So with Parsec, you can keep the *input* from filling up memory, but if you do, the *result* will still take up space (e.g. Right (value)). For a simple transformation where the output is a similar string to the input, it will be just as large, so not much space is actually saved (maybe a factor of 2 -- just keeping the output, not also the input), it seems.
Yeah, this is my understanding. frisby combats this via 'irrefutable' parser combinators. An irrefutable combinator is one that always succeeds, a prime example is the 'many' combinator. Since 'many' consumes only as many of its arguments as it can and is perfectly fine consuming nothing, it inherently always succeeds so the parser can immediately begin returning results (before consuming all of the input). Ironically, this means frisby often uses less space than other parsers, despite being based on PEGs which generally are known for taking a lot of space. It is not too hard to ensure your optimizer is irrefutable, for instance, the parser for a simple language might be
many statement <> eof
however, the 'eof' makes the parser non-irrefutabel. however it is easy to gain back by doing
many statement <> (eof // pure (error "unexpected data"))
frisbys static analysis realizes that (irrefutable // ... ) and ( ... // irrefutable) are irrefutable. John -- John Meacham - ⑆repetae.net⑆john⑈
participants (6)
-
Bulat Ziganshin
-
Chaddaï Fouché
-
Dan Doel
-
Isaac Dupree
-
John Meacham
-
Krzysztof Skrzętnicki