
On Saturday 25 September 2010 02:55:24, Ian Lynagh wrote:
On Fri, Sep 24, 2010 at 09:21:19PM +0200, Daniel Fischer wrote:
Proposal: A stricter implementation of lines.
Reason: The current implementation causes a space leak (cf. http://homepages.inf.ed.ac.uk/wadler/papers/leak/leak.ps), at least in GHC.
The proposed implementation fixes the leak at the small cost of being stricter if the first _|_ in the String is the first character of a line.
I think this changes the definition from one that currently has a space leak with GHC, to one which necessarily must have a space leak (as all of l must be held in memory while we look for a newline).
No, with the proposed implementation, e.g. counting line lengths runs in constant space: ./newTest 50000000 2 +RTS -s 50000000 50000001 50000002 18,070,120,752 bytes allocated in the heap 1,721,193,144 bytes copied during GC 34,164 bytes maximum residency (1642 sample(s)) 45,204 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 32825 collections, 0 parallel, 12.62s, 12.98s elapsed Generation 1: 1642 collections, 0 parallel, 0.52s, 0.59s elapsed vs. ./origTest 50000000 2 +RTS -s 50000000 50000001 50000002 18,070,120,940 bytes allocated in the heap 3,966,182,576 bytes copied during GC 446,080,496 bytes maximum residency (23 sample(s)) 10,059,688 bytes maximum slop 872 MB total memory in use (7 MB lost due to fragmentation) Generation 0: 34444 collections, 0 parallel, 13.81s, 24.92s elapsed Generation 1: 23 collections, 0 parallel, 6.55s, 27.90s elapsed Since break (== '\n') immediately constructs a pair, the pattern match also succeeds immediately after the first character has been scanned and thus the first line is availabe as it is scanned. If nothing else holds a reference to it, it can be garbage-collected incrementally as it is delivered and consumed.
IIRC GHC's GC does have an optimisation which I believe would apply here, where it reduces essentially (snd (x, y)) in the heap to y, but that that optimisation isn't always good enough (it has a threshold for how deep it is willing to reduce, and a small loop like this will generates pairs and selectors faster than the GC is willing to reduce them).
I thought we had a ticket about that, but I can't find it now.
If that was fixed then we could have constant space usage.
If that was fixed, that'd be much better of course, since the tuple space leak is a common problem. However, it's long known and not yet fixed for all cases, so it's probably hard to fix completely. In the meantime, patching things locally looks like the best option to me.
Thanks Ian
Cheers, Daniel