
Recently I stumbled upon a rather non-obvious behavior in what seems
like a fairly common use a vector[1]. It appears that, despite the claim
of the documentation, dropWhile can result in copies, resulting in a 1MB
vector ballooning to several gigabytes within a few seconds. This has
been reproduced with ghc 7.4.1 and 7.0.3 with -O and -prof.
My knowledge of the internals of vector is quite limited, so I'll leave
potential explanations to those with better understanding; below I've
attached some relevant discussion from #haskell. Let me know if there's
anything at all I could do to help.
Cheers,
- Ben
[1] http://hackage.haskell.org/trac/ghc/ticket/5893
Test case:
import qualified Data.Vector.Unboxed as V
type Time = Int
spansPhotons :: V.Vector Time -> [(Time,Time)] -> [V.Vector Time]
spansPhotons ts spans = map (\(start,_)->V.takeWhile (

See http://trac.haskell.org/vector/ticket/78. I'll fix this, but for now you can you fst (span f) instead of dropWhile f, span doesn't copy. Sorry about this. Roman On 24/02/2012, at 05:46, Ben Gamari wrote:
Recently I stumbled upon a rather non-obvious behavior in what seems like a fairly common use a vector[1]. It appears that, despite the claim of the documentation, dropWhile can result in copies, resulting in a 1MB vector ballooning to several gigabytes within a few seconds. This has been reproduced with ghc 7.4.1 and 7.0.3 with -O and -prof.
My knowledge of the internals of vector is quite limited, so I'll leave potential explanations to those with better understanding; below I've attached some relevant discussion from #haskell. Let me know if there's anything at all I could do to help.
Cheers,
- Ben
[1] http://hackage.haskell.org/trac/ghc/ticket/5893
Test case:
import qualified Data.Vector.Unboxed as V
type Time = Int
spansPhotons :: V.Vector Time -> [(Time,Time)] -> [V.Vector Time] spansPhotons ts spans = map (\(start,_)->V.takeWhile (
main = do let times = V.generate 1000000 (*100) spans = [(10000*i, 10000*i+5000) | i <- [1..10000]] let bursts = spansPhotons times spans print $ sum $ map V.length bursts print $ map (const 1) $ bursts
00:16 < rwbarton> there the problem is that the bursts aren't sharing the same memory 00:17 < rwbarton> like they would if you wrote a similar program with ByteString 00:17 < rwbarton> the thing is that the Vector implementation of dropWhile is good for the stream fusion situation where you have a pipeline of operations 00:17 < quintessence> Yeah, I don't think dropWhile can participate in stream fusion and not copy 00:18 < rwbarton> but not good when you want to dropWhile many different times on the same input 00:18 < rwbarton> right 00:18 < quintessence> because when you have dropWhile p . map f the thing you want to share with doesn't exist 00:18 < rwbarton> so I don't know what the resolution would be 00:18 < bgamari> ouch 00:18 < rwbarton> besides giving some more manual control over stream fusion perhaps 00:19 < bgamari> Yep 00:20 < bgamari> That is a tricky one 00:20 < dmwit> dropWhileFusable + dropWhileCopy 00:20 < dmwit> err... 00:21 < dmwit> dropWhileFusable + dropWhileDoesn'tCopy, I guess 00:21 < quintessence> Could you implement dropWhile the non-fusing way and then have a RULE for dropWhile p . unstream = unstream . S.dropWhile p? 00:21 < rwbarton> at the least you could implement dropWhile' the non-fusing way; maybe that already exists somewhere in Data.Vector 00:21 < tibbe> Isn't there a function to explicitly copy? Like ByteString's clone? 00:21 < rwbarton> there is 00:21 < rwbarton> but the goal here is to not copy 00:22 < tibbe> I'm joining this discussion late I'm afraid :) 00:22 < tibbe> it should be possible to write rules that only fire when fusion doesn't happen 00:22 < rwbarton> the issue is a program like http://hpaste.org/64262#a64263 ends up allocating a bunch of nonoverlapping vectors 00:22 < tibbe> they could do sharing, but the performance model might turn out to be confusing 00:22 < rwbarton> when it could share them (and the documentation would have you believe it does) 00:23 < tibbe> how would you document dropWhile? Might share the underlying vector? 00:23 < bgamari> tibbe: the performance model is already a bit confusing ;) 00:23 < tibbe> :) 00:24 < bgamari> I'm surprised I'm the first person to encounter this 00:25 < bgamari> The bug report is http://hackage.haskell.org/trac/ghc/ticket/5893#comment:1 00:29 < bgamari> rwbarton, Do you mind if I quote you in a message to libraries@?
participants (2)
-
Ben Gamari
-
Roman Leshchinskiy