Not sure if there is a way to write this code in IO monad which can reduce this overhead.
-harendra
On 14 May 2016 at 17:10, Harendra Kumar <
harendra.kumar@gmail.com> wrote:
>
> You are right about the way IO code is generated because of the ordering semantics. The IO version builds the list entirely in a recursive fashion before returning it while the pure code builds it lazily. I wrote very simple versions to keep things simpler:
>
> Pure version:
>
> f [] = []
> f (x : xs) = x : f xs
>
>
> time 11.08 ms (10.86 ms .. 11.34 ms)
> Measured for a million elements in the list
>
> 104,041,264 bytes allocated in the heap
> 16,728 bytes copied during GC
> 35,992 bytes maximum residency (1 sample(s))
> 21,352 bytes maximum slop
> 1 MB total memory in use (0 MB lost due to fragmentation)
>
>
> IO version:
> f [] = return []
> f (x : xs) = do
> rest <- f xs
> return $ x : rest
>
> time 79.66 ms (75.49 ms .. 82.55 ms)
>
> 208,654,560 bytes allocated in the heap
> 121,045,336 bytes copied during GC
> 27,679,344 bytes maximum residency (8 sample(s))
> 383,376 bytes maximum slop
> 66 MB total memory in use (0 MB lost due to fragmentation)
>
> Even though this is a small program not doing much and therefore enhancing even small differences to a great degree, I am not sure if I can completely explain the difference in slowness of the order of 7.5x by just the recursive vs lazy building of the list. I am wondering if there is anything that is worth further investigating and improving here.
>
> -harendra
>
> On 12 May 2016 at 05:41, Dan Doel <
dan.doel@gmail.com> wrote:
> >
> > On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
> > <
harendra.kumar@gmail.com> wrote:
> > > Thanks Dan, that helped. I did notice and suspect the update frame and the
> > > unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was
> > > not sure what is going on. In fact I thought that the intermediate tuple
> > > should get optimized out since it is required only because of the realworld
> > > token which is not real. But it might be difficult to see that at this
> > > level?
> >
> > The token exists as far as the STG level is concerned, I think,
> > because that is the only thing ensuring that things happen in the
> > right order. And the closure must be built to have properly formed
> > STG. So optimizing away the closure building would have to happen at a
> > level lower than STG, and I guess there is no such optimization. I'm
> > not sure how easy it would be to do.
> >
> > > What you are saying may be true for the current implementation but in theory
> > > can we eliminate the intermediate closure?
> >
> > I don't know. I'm a bit skeptical that building this one closure is
> > the only thing causing your code to be a factor of 5 slower. For
> > instance, another difference in the core is that the recursive call
> > corresponding to the result s2 happens before allocating the
> > additional closure. That is the case statement that unpacks the
> > unboxed tuple. And the whole loop happens this way, so it is actually
> > building a structure corresponding to the entire output list in memory
> > rather eagerly.
> >
> > By contrast, your pure function is able to act in a streaming fashion,
> > if consumed properly, where only enough of the result is built to keep
> > driving the rest of the program. It probably runs in constant space,
> > while your IO-based loop has a footprint linear in the size of the
> > input list (in addition to having slightly more overhead per character
> > because of the one extra thunk), because it is a more eager program.
> >
> > And having an asymptotically larger memory footprint is more likely to
> > explain a 5x slowdown than allocating one extra thunk for each list
> > concatenation, I think. (Of course, it could also be some other
> > factor, as well.)
> >
> > You should probably run with +RTS -s (compiling with -rtsopts), and
> > see if the IO version has a much larger maximum residency.
> >
> > Anyhow, this is fundamental to writing the algorithm using IO. It's an
> > algorithm that's a sequence of steps that happen in order, the number
> > of steps is proportional to the input list, and part of those steps is
> > putting stuff in memory for the results.
> >
> > > So why is that? Why can't we generate (++) inline similar to (:)? How do we
> > > make this decision? Is there a theoretical reason for this or this is just
> > > an implementation artifact?
> >
> > The difference between these two is that (++) is a function call, and
> > (:) is a constructor. Constructors are a special case, and don't need
> > to have all the provisions that functions in general do. The best way
> > to learn what the differences are is probably to read the paper about
> > the STG machine.
> >
> > Hope this is a more fruitful lead, but it may be that there's not a
> > lot that can be done, without doing some baroque things to your
> > algorithm, because of the necessity of its using IO.
> >
> > -- Dan