
Dear Haskellers, Another story from an (almost) happy Haskell user that finds himself overwhelmed by laziness/space leaks. I'm trying to parse a large file (>600MB) with a single S-expression like structure. With the help of ByteStrings I'm down to 4min processing time in constant space. However, when I try to wrap the parse results in a data structure, the heap blows up - even though I never actually inspect the structure being built! This bugs me, so I come here looking for answers. Parser follows:
module Main where
import qualified Data.ByteString.Lazy.Char8 as B import Text.ParserCombinators.Parsec import Text.ParserCombinators.Parsec.Pos import System.Environment import System.Exit import qualified Data.Map as M import Lexer
type XdlParser a = GenParser Token XdlState a
-- Parser state type XdlState = Counts
type Counts = M.Map Count Integer
data Count = ListCount | SymbolCount deriving (Eq, Ord, Show)
emptyXdlState = M.empty
incCount :: Count -> (Counts -> Counts) incCount c = M.insertWith' (+) c 1
-- handling tokens myToken :: (Token -> Maybe a) -> XdlParser a myToken test = token showTok posTok testTok where showTok = show posTok = const (initialPos "") testTok = test
-- Syntax of expressions data Exp = Sym !B.ByteString | List ![Exp] deriving (Eq, Show)
expr = list <|> symbol
rparen = myToken $ \t -> case t of RParen -> Just () other -> Nothing
lparen = myToken $ \t -> case t of LParen -> Just () other -> Nothing
name = myToken $ \t -> case t of Name n -> Just n other -> Nothing
list = do updateState $ incCount ListCount lparen xs <- many1 expr rparen return () -- return $! (List xs)
symbol = do updateState $ incCount SymbolCount name >> return () -- Sym `fmap` name
-- Top level parser top :: XdlParser XdlState top = do l <- many1 list eof getState
main = do args <- getArgs case args of [fname] -> do text <- B.readFile fname let result = runParser top emptyXdlState fname (tokenize text) putStrLn $ either show show result _ -> putStrLn "usage: parse filename" >> exitFailure
And the Lexer:
module Lexer (Token(..), tokenize) where
import qualified Data.ByteString.Lazy.Char8 as B import Control.Monad import Data.Char import Data.List import System.Environment import System.Exit
data Token = LParen | RParen | Name B.ByteString deriving (Ord, Eq, Show)
type Input = B.ByteString
-- Processor returns Nothing if it can't process the Input type Processor = Input -> Maybe ([Token], Input)
-- Tokenize ends the list when all processors return Nothing tokenize :: Input -> [Token] tokenize = concat . unfoldr processAll where processors = [doSpaces, doComment, doParens, doName] processAll :: Processor processAll bs = if B.null bs then Nothing else foldr mminus Nothing $ map ($ bs) processors mminus a@(Just _) _ = a mminus Nothing b = b
doSpaces :: Processor doSpaces bs = if B.null sp then Nothing else Just ([], nsp) where (sp, nsp) = B.span isSpace bs
doComment :: Processor doComment bs = if B.pack "# " `B.isPrefixOf` bs then Just ([], B.dropWhile (/= '\n') bs) else Nothing
doParens :: Processor doParens bs = case B.head bs of '(' -> Just ([LParen], B.tail bs) ')' -> Just ([RParen], B.tail bs) _ -> Nothing
doName :: Processor doName bs = if B.null nsp then Nothing else Just ([Name nsp], sp) where (nsp, sp) = B.span (not . isRest) bs isRest c = isSpace c || c == ')' || c == '('
Regards, -- Krzysztof Kościuszkiewicz Skype: dr.vee, Gadu: 111851, Jabber: kokr@jabberpl.org "Simplicity is the ultimate sophistication" -- Leonardo da Vinci

On Mon, Mar 3, 2008 at 2:23 AM, Krzysztof Kościuszkiewicz
Dear Haskellers,
Another story from an (almost) happy Haskell user that finds himself overwhelmed by laziness/space leaks.
I'm trying to parse a large file (>600MB) with a single S-expression like structure. With the help of ByteStrings I'm down to 4min processing time in constant space. However, when I try to wrap the parse results in a data structure, the heap blows up - even though I never actually inspect the structure being built! This bugs me, so I come here looking for answers.
Well, I haven't read this through, but superficially, it looks like you're expecting the data structure to be constructed lazily. But...
-- Syntax of expressions data Exp = Sym !B.ByteString | List ![Exp] deriving (Eq, Show)
It is declared as strict, so it's not going to be constructed lazily... Luke

Krzysztof Kościuszkiewicz wrote:
Another story from an (almost) happy Haskell user that finds himself overwhelmed by laziness/space leaks.
I'm trying to parse a large file (>600MB) with a single S-expression like structure. With the help of ByteStrings I'm down to 4min processing time in constant space. However, when I try to wrap the parse results in a data structure, the heap blows up - even though I never actually inspect the structure being built! This bugs me, so I come here looking for answers.
Note that Parsec has to parse the whole file before it can decide whether to return a result (Left _) or an error (Right _). ghc would have to be quite smart to eliminate the creation of the expression tree entirely. The polyparse library (http://www.cs.york.ac.uk/fp/polyparse/) offers some lazy parsers, maybe one of those fits your needs. Text.ParserCombinators.Poly.StateLazy is the obvious candidate. HTH, Bertram

On Mon, Mar 03, 2008 at 05:20:09AM +0100, Bertram Felgenhauer wrote:
Another story from an (almost) happy Haskell user that finds himself overwhelmed by laziness/space leaks.
I'm trying to parse a large file (>600MB) with a single S-expression like structure. With the help of ByteStrings I'm down to 4min processing time in constant space. However, when I try to wrap the parse results in a data structure, the heap blows up - even though I never actually inspect the structure being built! This bugs me, so I come here looking for answers.
The polyparse library (http://www.cs.york.ac.uk/fp/polyparse/) offers some lazy parsers, maybe one of those fits your needs. Text.ParserCombinators.Poly.StateLazy is the obvious candidate.
I have tried both Poly.StateLazy and Poly.State and they work quite well - at least the space leak is eliminated. Now evaluation of the parser state blows the stack... The code is at http://hpaste.org/6310 Thanks in advance, -- Krzysztof Kościuszkiewicz Skype: dr.vee, Gadu: 111851, Jabber: kokr@jabberpl.org "Simplicity is the ultimate sophistication" -- Leonardo da Vinci

On Wed, Mar 12, 2008 at 12:12 PM, Krzysztof Kościuszkiewicz
I have tried both Poly.StateLazy and Poly.State and they work quite well - at least the space leak is eliminated. Now evaluation of the parser state blows the stack...
The code is at http://hpaste.org/6310
Thanks in advance,
The stack blows up when a bunch of unevaluated thunks build up, and you try to evaluate them. One way to determine where those thunks are getting built is to use GHCs "retainer" profiling. Retainer sets will show you the "call stack" that is holding on to memory. That can give you a clue where these thunks are being created. To get finer-grained results, annotate your code with {#- SCC "..." #-} pragmas. Then you can filter the retainer profile by those annotations. That will help you determine where in a given function the thunks are being created. If you need help with profiling basics, feel free to ask. Justin

On Wed, Mar 12, 2008 at 12:34:38PM -0700, Justin Bailey wrote:
The stack blows up when a bunch of unevaluated thunks build up, and you try to evaluate them. One way to determine where those thunks are getting built is to use GHCs "retainer" profiling. Retainer sets will show you the "call stack" that is holding on to memory. That can give you a clue where these thunks are being created. To get finer-grained results, annotate your code with {#- SCC "..." #-} pragmas. Then you can filter the retainer profile by those annotations. That will help you determine where in a given function the thunks are being created.
If you need help with profiling basics, feel free to ask.
I'm not entirely sure if I understand retainer profiling correctly... So please clarify if you spot any obvious blunders. Retainers are thunks or objects on stack that keep references to live objects. All retainers of an object are called the object's retainer set. Now when one makes a profiling run, say with ./jobname +RTS -p -hr, the graph refernces retainer sets from jobname.prof. My understanding is that it is the total size of all objects retained by retainer sets being plotted, correct? About decoding the sets from jobname.prof - for example in
SET 2 = {
} SET 16 = { , } SET 18 = { , }
{...} means it's a set, and

On Thu, Mar 13, 2008 at 4:50 PM, Krzysztof Kościuszkiewicz
Retainers are thunks or objects on stack that keep references to live objects. All retainers of an object are called the object's retainer set. Now when one makes a profiling run, say with ./jobname +RTS -p -hr, the graph refernces retainer sets from jobname.prof. My understanding is that it is the total size of all objects retained by retainer sets being plotted, correct?
Yes, all retainer sets are being profiled. However, you can FILTER the retainer sets profiled to those containing certain cost-centres. This is a key point because it allows you to "divide-and-conquer" when tracking down a retainer leak. That is, if you filter to a certain cost-centre and the retainer graph is flat, you know that cost-centre is not involved. For example, if you have a cost-centre annotation like {-# SCC "leaky" #-} in your code, you can filter the retainer set like this: Leaky.exe +RTS -hr -hCleaky -RTS Review the documentation for other options.
About decoding the sets from jobname.prof - for example in
SET 2 = {
} SET 16 = { , } SET 18 = { , } {...} means it's a set, and
is the retainer cost centre (ccN) and hierarchy of parent cost centres up to the "top level" (cc0)? My understanding is that SET 18 above refers to objects that are retained by exactly two specified cost centres, right?
The docs say "An object B retains object A if (i) B is a retainer object and (ii) object A can be reached by recursively following pointers starting from object B, but not meeting any other retainer objects on the way. Each live object is retained by one or more retainer objects, collectively called its retainer set ..." That says to me that SET18 above is the set of all objects which are retained by those two "call stacks", and only those call stacks. The individual <..> items aren't "call stacks" but I think they refer to where the retaining object (B in the paragraph) was itself retained, so they are like call stacks. My intuition is very fuzzy here.
Finally, what is the MAIN.SYSTEM retainer?
I think that is "everything else" - any object created in the runtime system that is not directly attributable to something being profiled. Maybe it is objects from libraries that were not compiled with profiling? I imagine objects created by the GHC primitives would fall in this category too. Since someone else found your space leak, does the retainer profiling advice point to it? I'd like to know if it is actually accurate or not! I've only applied it in some very limited situations. Justin

Krzysztof Kościuszkiewicz wrote:
I have tried both Poly.StateLazy and Poly.State and they work quite well - at least the space leak is eliminated. Now evaluation of the parser state blows the stack...
The code is at http://hpaste.org/6310
Apparently, stUpdate is too lazy. I'd define stUpdate' :: (s -> s) -> Parser s t () stUpdate' f = stUpdate f >> stGet >>= (`seq` return ()) and try using stUpdate' instead of stUpdate in incCount. HTH, Bertram

On Thu, Mar 13, 2008 at 05:52:05PM +0100, Bertram Felgenhauer wrote:
... Now evaluation of the parser state blows the stack...
The code is at http://hpaste.org/6310
Apparently, stUpdate is too lazy. I'd define
stUpdate' :: (s -> s) -> Parser s t () stUpdate' f = stUpdate f >> stGet >>= (`seq` return ())
and try using stUpdate' instead of stUpdate in incCount.
Yes, that solves the stack issue. Thanks! -- Krzysztof Kościuszkiewicz Skype: dr.vee, Gadu: 111851, Jabber: kokr@jabberpl.org "Simplicity is the ultimate sophistication" -- Leonardo da Vinci
participants (5)
-
Bertram Felgenhauer
-
Justin Bailey
-
Krzysztof Kościuszkiewicz
-
Krzysztof Kościuszkiewicz
-
Luke Palmer