Make lines stricter to fix a space leak

Proposal: A stricter implementation of lines. Reason: The current implementation causes a space leak (cf. http://homepages.inf.ed.ac.uk/wadler/papers/leak/leak.ps), at least in GHC. The proposed implementation fixes the leak at the small cost of being stricter if the first _|_ in the String is the first character of a line. Discussion period: Three weeks, until 15th October (because of ICFP). Ticket: http://hackage.haskell.org/trac/ghc/ticket/4334 Cheers, Daniel

On Fri, Sep 24, 2010 at 09:21:19PM +0200, Daniel Fischer wrote:
Proposal: A stricter implementation of lines.
Reason: The current implementation causes a space leak (cf. http://homepages.inf.ed.ac.uk/wadler/papers/leak/leak.ps), at least in GHC.
The proposed implementation fixes the leak at the small cost of being stricter if the first _|_ in the String is the first character of a line.
I think this changes the definition from one that currently has a space leak with GHC, to one which necessarily must have a space leak (as all of l must be held in memory while we look for a newline). IIRC GHC's GC does have an optimisation which I believe would apply here, where it reduces essentially (snd (x, y)) in the heap to y, but that that optimisation isn't always good enough (it has a threshold for how deep it is willing to reduce, and a small loop like this will generates pairs and selectors faster than the GC is willing to reduce them). I thought we had a ticket about that, but I can't find it now. If that was fixed then we could have constant space usage. Thanks Ian

On Saturday 25 September 2010 02:55:24, Ian Lynagh wrote:
On Fri, Sep 24, 2010 at 09:21:19PM +0200, Daniel Fischer wrote:
Proposal: A stricter implementation of lines.
Reason: The current implementation causes a space leak (cf. http://homepages.inf.ed.ac.uk/wadler/papers/leak/leak.ps), at least in GHC.
The proposed implementation fixes the leak at the small cost of being stricter if the first _|_ in the String is the first character of a line.
I think this changes the definition from one that currently has a space leak with GHC, to one which necessarily must have a space leak (as all of l must be held in memory while we look for a newline).
No, with the proposed implementation, e.g. counting line lengths runs in constant space: ./newTest 50000000 2 +RTS -s 50000000 50000001 50000002 18,070,120,752 bytes allocated in the heap 1,721,193,144 bytes copied during GC 34,164 bytes maximum residency (1642 sample(s)) 45,204 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 32825 collections, 0 parallel, 12.62s, 12.98s elapsed Generation 1: 1642 collections, 0 parallel, 0.52s, 0.59s elapsed vs. ./origTest 50000000 2 +RTS -s 50000000 50000001 50000002 18,070,120,940 bytes allocated in the heap 3,966,182,576 bytes copied during GC 446,080,496 bytes maximum residency (23 sample(s)) 10,059,688 bytes maximum slop 872 MB total memory in use (7 MB lost due to fragmentation) Generation 0: 34444 collections, 0 parallel, 13.81s, 24.92s elapsed Generation 1: 23 collections, 0 parallel, 6.55s, 27.90s elapsed Since break (== '\n') immediately constructs a pair, the pattern match also succeeds immediately after the first character has been scanned and thus the first line is availabe as it is scanned. If nothing else holds a reference to it, it can be garbage-collected incrementally as it is delivered and consumed.
IIRC GHC's GC does have an optimisation which I believe would apply here, where it reduces essentially (snd (x, y)) in the heap to y, but that that optimisation isn't always good enough (it has a threshold for how deep it is willing to reduce, and a small loop like this will generates pairs and selectors faster than the GC is willing to reduce them).
I thought we had a ticket about that, but I can't find it now.
If that was fixed then we could have constant space usage.
If that was fixed, that'd be much better of course, since the tuple space leak is a common problem. However, it's long known and not yet fixed for all cases, so it's probably hard to fix completely. In the meantime, patching things locally looks like the best option to me.
Thanks Ian
Cheers, Daniel

On Sat, Sep 25, 2010 at 03:45:15AM +0200, Daniel Fischer wrote:
On Saturday 25 September 2010 02:55:24, Ian Lynagh wrote:
On Fri, Sep 24, 2010 at 09:21:19PM +0200, Daniel Fischer wrote:
Proposal: A stricter implementation of lines.
Reason: The current implementation causes a space leak (cf. http://homepages.inf.ed.ac.uk/wadler/papers/leak/leak.ps), at least in GHC.
The proposed implementation fixes the leak at the small cost of being stricter if the first _|_ in the String is the first character of a line.
I think this changes the definition from one that currently has a space leak with GHC, to one which necessarily must have a space leak (as all of l must be held in memory while we look for a newline).
No, with the proposed implementation, e.g. counting line lengths runs in constant space:
Ah, yes, you are right, sorry. I don't have a strong opinion, then. We ought to put the old definition into the H98 and H2010 packages, though. Hmm, but currently Prelude always comes from base. Thanks Ian

On Saturday 25 September 2010 13:30:11, Ian Lynagh wrote:
We ought to put the old definition into the H98 and H2010 packages, though. Hmm, but currently Prelude always comes from base.
Hmm, that is a problem. I suppose one could argue for interpreting "In this chapter the entire Haskell Prelude is given. It constitutes a specification for the Prelude." somewhat laxer, but that smells of weaseling out. On the other hand, "GHC's implementation of array takes the value of an array slot from the last (index,value) pair in the list, and does no checking for duplicates. The reason for this is efficiency, pure and simple." , so there's a precedence for #ifdef __GLASGOW_HASKELL__ non-leaking implementation for GHC #else report implementation #endif It's not ideal, but the space leak may be a serious issue (cf. http://www.haskell.org/pipermail/haskell-cafe/2010-September/083177.html where Henning Thielemann wrote "I got several space leaks of this kind in the past. They are very annoying. They are especially annoying if input comes from the outside world, where people can attack them to crash your program because of memory exhaustion.").
Thanks Ian
Cheers, Daniel

On 25/09/10 12:30, Ian Lynagh wrote:
On Sat, Sep 25, 2010 at 03:45:15AM +0200, Daniel Fischer wrote:
On Saturday 25 September 2010 02:55:24, Ian Lynagh wrote:
On Fri, Sep 24, 2010 at 09:21:19PM +0200, Daniel Fischer wrote:
Proposal: A stricter implementation of lines.
Reason: The current implementation causes a space leak (cf. http://homepages.inf.ed.ac.uk/wadler/papers/leak/leak.ps), at least in GHC.
The proposed implementation fixes the leak at the small cost of being stricter if the first _|_ in the String is the first character of a line.
I think this changes the definition from one that currently has a space leak with GHC, to one which necessarily must have a space leak (as all of l must be held in memory while we look for a newline).
No, with the proposed implementation, e.g. counting line lengths runs in constant space:
Ah, yes, you are right, sorry. I don't have a strong opinion, then.
We ought to put the old definition into the H98 and H2010 packages, though. Hmm, but currently Prelude always comes from base.
Actually the haskell2010 package has its own Prelude, which could be different from the one in base if we wish. Cheers, Simon

On Sun, Sep 26, 2010 at 11:33:09AM +0100, Simon Marlow wrote:
On 25/09/10 12:30, Ian Lynagh wrote:
We ought to put the old definition into the H98 and H2010 packages, though. Hmm, but currently Prelude always comes from base.
Actually the haskell2010 package has its own Prelude, which could be different from the one in base if we wish.
Well, that'll teach me to only look at H98 and extrapolate. So we could give H98 its own Prelude too, hide the haskell98 package, and not have it linked by default. That would be consistent with defaulting to the H2010 (or H2011, as this'll be in 7.2) language too. This would also free us up to make other Prelude tidyups, e.g. to make Prelude.catch be the modern version. Thanks Ian

On 25/09/10 01:55, Ian Lynagh wrote:
On Fri, Sep 24, 2010 at 09:21:19PM +0200, Daniel Fischer wrote:
Proposal: A stricter implementation of lines.
Reason: The current implementation causes a space leak (cf. http://homepages.inf.ed.ac.uk/wadler/papers/leak/leak.ps), at least in GHC.
The proposed implementation fixes the leak at the small cost of being stricter if the first _|_ in the String is the first character of a line.
I think this changes the definition from one that currently has a space leak with GHC, to one which necessarily must have a space leak (as all of l must be held in memory while we look for a newline).
IIRC GHC's GC does have an optimisation which I believe would apply here, where it reduces essentially (snd (x, y)) in the heap to y, but that that optimisation isn't always good enough (it has a threshold for how deep it is willing to reduce, and a small loop like this will generates pairs and selectors faster than the GC is willing to reduce them).
There were two depth limits in the GC. The first affected chains of the form snd (x, snd (x, snd (x, ...))) this was fixed a while ago. The other one is: snd $ snd $ snd $ ... which still has a depth limit (16) to avoid unbounded recursion in the GC. This second form is less common I think than the other one, which cropped up quite a lot.
I thought we had a ticket about that, but I can't find it now.
Feel free to make a ticket if you like. I'd be inclined to wait and see if anyone actually runs into it in practice. The other problem in this area is that the simplifier tends to transform code such that selector thunks aren't recognisable any more, so the GC optimisation doesn't apply. I think we have a ticket for this one (but as I'm on a plane right now I can't find it). Cheers, Simon
If that was fixed then we could have constant space usage.
Thanks Ian
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

Simon Marlow wrote: [snip]
The other problem in this area is that the simplifier tends to transform code such that selector thunks aren't recognisable any more, so the GC optimisation doesn't apply. I think we have a ticket for this one (but as I'm on a plane right now I can't find it).
This is the ticket: http://hackage.haskell.org/trac/ghc/ticket/2607 HTH, Bertram

Am 24.09.2010 21:21, schrieb Daniel Fischer:
Proposal: A stricter implementation of lines.
Reason: The current implementation causes a space leak (cf. http://homepages.inf.ed.ac.uk/wadler/papers/leak/leak.ps), at least in GHC.
The proposed implementation fixes the leak at the small cost of being stricter if the first _|_ in the String is the first character of a line.
Discussion period: Three weeks, until 15th October (because of ICFP).
You make interesting proposals. Changing a let to a case! I wonder if a generic version costs performance? lines = breaksBy (== "\n") (or "linesBy" or "splitBy") Cheers Christian
Cheers, Daniel

On Monday 27 September 2010 10:54:56, Christian Maeder wrote:
I wonder if a generic version costs performance?
lines = breaksBy (== "\n") (or "linesBy" or "splitBy")
Cheers Christian
Gut feeling said it shouldn't and benchmarking supports that. So, is that a generic enough operation to add it to the Data.List API? If yes, the best name has to be found. On the one hand, splitBy or breaksBy sound nicer than linesBy, because generically, it has nothing to do with lines. On the other hand, neither break nor split[At] remove the separators while lines does. Also, there's linesBy in Data.List.Split (http://hackage.haskell.org/packages/archive/split/0.1.2.1/doc/html/Data- List-Split.html) which does exactly that. But Data.List.Split.linesBy is faster (for reasonably short lines). However, it dies a horrible death (Stack space overflow: current size 67108864 bytes.) for very long lines and is then much slower if you give it enough stack to complete. So I would say, put the generic version into Data.List as linesBy. I think that deserves its own proposal.

Daniel Fischer wrote:
Proposal: A stricter implementation of lines. Reason: The current implementation causes a space leak Ticket: http://hackage.haskell.org/trac/ghc/ticket/4334
I propose the following combinator approach as an alternative: lines :: String -> [String] lines = map (takeWhile (/= '\n')) . takeWhile (not . null) . iterate (drop 1 . dropWhile (/= '\n')) GHC fuses that into a tight loop. So in addition to solving the space leak, it is faster and uses less heap than both the existing implementation and Daniel's proposal. It is also cleaner and easier to read, in my opinion. Thanks, Yitz

On Monday 27 September 2010 19:40:52, Yitzchak Gale wrote:
Daniel Fischer wrote:
Proposal: A stricter implementation of lines. Reason: The current implementation causes a space leak Ticket: http://hackage.haskell.org/trac/ghc/ticket/4334
I propose the following combinator approach as an alternative:
lines :: String -> [String] lines = map (takeWhile (/= '\n')) . takeWhile (not . null) . iterate (drop 1 . dropWhile (/= '\n'))
GHC fuses that into a tight loop.
Not here. Neither 6.12.3 nor 6.13.20100917 did that. Here, that uses even more memory than the current lines. The core is ewww, you get functions for (/= '\n') [two] and (not . null), apart from that it's almost verbatim the Haskell code (split in chunks at the compositions). Pity. Would've been nice to fix the leak without changing the semantics (and getting faster to boot).
So in addition to solving the space leak, it is faster and uses less heap than both the existing implementation and Daniel's proposal.
I assume you tested on a 64-bit platform with lots of registers? For us poor souls stuck on 32-bits, things look apparently *very* different. Even using Data.List.Stream instead of the Prelude functions didn't help much.
It is also cleaner and easier to read, in my opinion.
Thanks, Yitz
Cheers, Daniel

On Monday 27 September 2010 20:36:36, Daniel Fischer wrote:
Pity. Would've been nice to fix the leak without changing the semantics
I think we can. :D lines :: String -> [String] lines "" = [] lines s = uncurry (:) $ case break (== '\n') s of (l,s') -> (l, case s' of [] -> [] _:s'' -> lines s'') seems to do it. Although the Core looks as though it might leak, Rec { NewLines.lines :: GHC.Base.String -> [GHC.Base.String] GblId [Arity 1 NoCafRefs Str: DmdType S] NewLines.lines = \ (ds_diq :: [GHC.Types.Char]) -> case ds_diq of wild_B1 { [] -> GHC.Types.[] @ GHC.Base.String; : ipv_siy ipv1_siz -> let { p_sjz :: ([GHC.Types.Char], [[GHC.Types.Char]]) LclId [Str: DmdType] p_sjz = case GHC.List.$wbreak @ GHC.Types.Char lvl_rjC wild_B1 of _ { (# ww1_aj2, ww2_aj3 #) -> (ww1_aj2, case ww2_aj3 of _ { [] -> GHC.Types.[] @ [GHC.Types.Char]; : _ s''_adi -> NewLines.lines s''_adi }) } } in GHC.Types.: @ [GHC.Types.Char] (case p_sjz of _ { (x_aiM, _) -> x_aiM }) (case p_sjz of _ { (_, y_aiS) -> y_aiS }) -- this looks like the leak Wadler described, -- but it behaves differently } end Rec } , it ran in constant memory in my tests. Allocation behaviour is very close to that of the originally proposed version (as far as I can tell identical to the current implementation of lines), reported maximum residency usually identical, sometimes a few dozen bytes difference in either direction. Speed seems also indistinguishable (both are faster than the current on long lines due to less GC, no measurable difference for short lines). The strictness properties are identical to those of the current implementation. Would anybody care to check that the above runs in constant memory not only on my computer? If that's verified, I would propose the above strictness-preserving rather than my original proposal. Cheers, Daniel

Daniel Fischer schrieb:
On Monday 27 September 2010 22:18:40, Daniel Fischer wrote:
Although the Core looks as though it might leak,
And with 6.8.3 it does leak indeed, but no more with 6.10.* and 6.12.*. All produced identical Core, so it's probably a change in the garbage collector.
How can we be sure it runs in constant memory in future GHC versions? Or in other Haskell compilers? Those fixes to memory leaks look very fragile to me.

On Monday 04 October 2010 19:32:54, Henning Thielemann wrote:
Daniel Fischer schrieb:
On Monday 27 September 2010 22:18:40, Daniel Fischer wrote:
Although the Core looks as though it might leak,
And with 6.8.3 it does leak indeed, but no more with 6.10.* and 6.12.*. All produced identical Core, so it's probably a change in the garbage collector.
How can we be sure it runs in constant memory in future GHC versions?
We can't.
Or in other Haskell compilers?
Nor can we here. If it doesn't work for other compilers, #ifdef __GLASGOW_HASKELL__ would be a workaround.
Those fixes to memory leaks look very fragile to me.
Yes, they are fragile. However, since (to my surprise) uncurry isn't available in Data.List, I've tried with a special function for uncurry (:), cons :: (a,[a]) -> [a] cons (x,xs) = x : xs vlines :: String -> [String] vlines "" = [] vlines s = cons $ case break (== '\n') s of (l, s') -> (l, case s' of [] -> [] _:s'' -> vlines s'') , which gives slightly less fragile looking Core ( case GHC.List.$wbreak @ GHC.Types.Char lvl_rjf wild_B1 of _ { (# ww1_aiR, ww2_aiS #) -> GHC.Types.: @ [GHC.Types.Char] ww1_aiR (case ww2_aiS of _ { [] -> GHC.Types.[] @ [GHC.Types.Char]; : _ s''_adv -> VLines.vlines s''_adv }) --the second argument of (:) doesn't contain a reference to the first anymore). Well, 6.8.3 does also leak with that, but that also leaks with a naked case. So yeah, it's a fragile fix, but IMO better than leaving the space leak unattended because we haven't a perfect solution.

On Mon, 4 Oct 2010, Daniel Fischer wrote:
On Monday 04 October 2010 19:32:54, Henning Thielemann wrote:
Those fixes to memory leaks look very fragile to me.
Yes, they are fragile.
My concern is, whether any fix is really a fix. I have seen several clever solutions for my memory leaks, that start to leak again in the greater context where I observed the leak the first time.

On Monday 04 October 2010 21:01:33, Henning Thielemann wrote:
On Mon, 4 Oct 2010, Daniel Fischer wrote:
On Monday 04 October 2010 19:32:54, Henning Thielemann wrote:
Those fixes to memory leaks look very fragile to me.
Yes, they are fragile.
My concern is, whether any fix is really a fix. I have seen several clever solutions for my memory leaks, that start to leak again in the greater context where I observed the leak the first time.
Might happen here too. I have no idea for a guaranteed solution.

On Monday 04 October 2010 20:19:39, Daniel Fischer wrote:
Yes, they are fragile. However, since (to my surprise) uncurry isn't available in Data.List, I've tried with a special function for uncurry (:),
cons :: (a,[a]) -> [a] cons (x,xs) = x : xs
vlines :: String -> [String] vlines "" = [] vlines s = cons $ case break (== '\n') s of (l, s') -> (l, case s' of [] -> [] _:s'' -> vlines s'')
, which gives slightly less fragile looking Core (
--the second argument of (:) doesn't contain a reference to the first anymore).
Arrgh. And that is again as strict as the naked case, cons needs a lazy pattern, which means it gets the same core as uncurry (:). Still, it doesn't leak on current GHCs, so for the time being, it's a fix (not on JHC though, that leaks with all implementations I tried). If anybody can offer a more stable fix of the leak, I'd be happy.

On 24 September 2010 15:21, Daniel Fischer
Proposal: A stricter implementation of lines.
Reason: The current implementation causes a space leak (cf. http://homepages.inf.ed.ac.uk/wadler/papers/leak/leak.ps), at least in GHC.
The proposed implementation fixes the leak at the small cost of being stricter if the first _|_ in the String is the first character of a line.
Discussion period: Three weeks, until 15th October (because of ICFP).
My gut instinct is that we should not make lines stricter. Generally the list library is as lazy as possible, except when there are compelling reasons otherwise. I am not yet convinced that this proposal meets that standard. Have we thought about the opposite issue, that there may be programs that rely on the current non-strict version for correctness or memory behaviour? We currently have an open proposal on making intersperse less strict because it caused practical problems (unexpected memory behaviour). There's also an issue of consistency with other functions in the list library. The lines function is similar in many ways to group / groupBy, and with the proposed change the strictness properties of these functions would be inconsistent. Such inconsistency has a cost in terms of programmers being able to predict and reason about strictness and space use (simply because if its inconsistent then it is harder to remember). The strictness of lines is pretty subtle. I think it would help our discussion if people reading this thread look at the strictness properties given in the ticket. Duncan

On Monday 04 October 2010 20:15:07, Duncan Coutts wrote:
On 24 September 2010 15:21, Daniel Fischer
wrote: Proposal: A stricter implementation of lines.
Reason: The current implementation causes a space leak (cf. http://homepages.inf.ed.ac.uk/wadler/papers/leak/leak.ps), at least in GHC.
The proposed implementation fixes the leak at the small cost of being stricter if the first _|_ in the String is the first character of a line.
Discussion period: Three weeks, until 15th October (because of ICFP).
My gut instinct is that we should not make lines stricter.
I didn't see a way without changing the strictness properties at first. Now there *seems* to be one, although I have doubts concerning its reliability to fix the leak. It's certainly more delicate than the stricter version.
Generally the list library is as lazy as possible, except when there are compelling reasons otherwise. I am not yet convinced that this proposal meets that standard.
Have we thought about the opposite issue, that there may be programs that rely on the current non-strict version for correctness or memory behaviour?
The possibility that some programmes rely on the current behaviour certainly exists. Though, the only change in behaviour is when the first Char on a line is bottom, so I don't expect it would affect many programmes. The memory behaviour can only be worse (by more than a few bytes) when the first Char on a line is the result of a space-demanding computation for the stricter version and not at all (as far as I can see) for the new suggestion. I'd be surprised if that affected any programme negatively.
We currently have an open proposal on making intersperse less strict because it caused practical problems (unexpected memory behaviour).
There's also an issue of consistency with other functions in the list library. The lines function is similar in many ways to group / groupBy, and with the proposed change the strictness properties of these functions would be inconsistent. Such inconsistency has a cost in terms of programmers being able to predict and reason about strictness and space use (simply because if its inconsistent then it is harder to remember).
Agreed. That's why I favour the new suggestion (until someone comes forth with a reliable way to fix the leak or someone finds that it doesn't fix the leak in real use cases - it worked in my tests, but those were fairly simple).
The strictness of lines is pretty subtle. I think it would help our discussion if people reading this thread look at the strictness properties given in the ticket.
Duncan
participants (10)
-
Bertram Felgenhauer
-
Christian Maeder
-
Daniel Fischer
-
Duncan Coutts
-
Felipe Lessa
-
Henning Thielemann
-
Henning Thielemann
-
Ian Lynagh
-
Simon Marlow
-
Yitzchak Gale