
On Wednesday 08 September 2010 06:44:32, Bryan O'Sullivan wrote:
I have a Replace.hs benchmark in the main text repo, just to be sure we're talking about the same thing.
Grabbed the darcs repo, compiled with the darcs version and also with the installed text package, both exhibit the same behaviour - what I reported before.
Factoring out the time spent on I/O, with GHC HEAD, my replace code takes twice the space and time of that in the stringsearch package.
That's very good. Twice the space is a given for mostly ASCII text, twice the time is okay, I think. My timings are quite different, but that's probably because 6.12.3's inliner doesn't give the full fusion benefit, so it'll improve automatically with the next GHC release.
Given that the space involved is just 121KB maximum residency while processing a 124MB file, I'm not concerned about it.
I wouldn't either.
And the time required isn't a bad place to start from, I think.
By the way, as this implies, I can't reproduce your space behaviour at all.
That's surprising. Have you made sure to replace a pattern which does not occur in the text? Can you reproduce the behaviour with a) Data.List.intersperse instead of the lazier version now used, b) ghc-6.12.* instead of HEAD? Anyway, I would've thought that with split pat src | null pat = emptyError "split" | isSingleton pat = splitBy (== head pat) src | otherwise = go 0 (indices pat src) src where go _ [] cs = [cs] go !i (x:xs) cs = let h :*: t = splitAtWord (x-i) cs in h : go (x+l) xs (dropWords l t) l = foldlChunks (\a (T.Text _ _ b) -> a + fromIntegral b) 0 pat you can't start returning chunks before it's known whether the list of indices is empty, so split would have O(index of pattern) space behaviour. If HEAD manages to make the chunks available before they are complete (before it's known how long they will be), it's even awesomer than I'd have dared to hope. Okay, so I'll have to try HEAD.
I can now say more. Looking at Data.Text.Lazy.replace,
replace s d = intercalate d . split s
, I also got a space leak with that for BS.Lazy's intercalate and
stringsearch's split.
How did you observe the space leak? Looking at -hT charted with hp2ps shows me nothing, and the program executes in constant space regardless of input size.
top and +RTS -s. I've attached the -hT graph of a run on a 74.3MB file with a pattern that does not occur. It shows exactly the behaviour I expected from the code of split, pretty constant growth of the heap until about twice the file size is reached, then fast and pretty constant shrinking as the text is output. The graphs are much more interesting if you do replacements of patterns without large gaps between occurrences.