The difference seems to be entirely due to memory pressure. At list size 1000 both pure version and IO version perform equally. But as the size of the list increases the pure version scales linearly while the IO version degrades exponentially. Here are the execution times per list element in ns as the list size increases:

Size of list Pure IO
1000 8.7 8.3
10000 8.7 18
100000 8.8 63
1000000 9.3 786

This seems to be due to increased GC activity in the IO case. The GC stats for list size 1 million are:

IO case: %GC time 66.1% (61.1% elapsed)
Pure case: %GC time 2.6% (3.3% elapsed)

Not sure if there is a way to write this code in IO monad which can reduce this overhead.

-harendra

On 14 May 2016 at 17:10, Harendra Kumar <harendra.kumar@gmail.com> wrote:
>
> You are right about the way IO code is generated because of the ordering semantics. The IO version builds the list entirely in a recursive fashion before returning it while the pure code builds it lazily. I wrote very simple versions to keep things simpler:
>
> Pure version:
>
> f [] = []
> f (x : xs) = x : f xs
>
>
> time 11.08 ms (10.86 ms .. 11.34 ms)
> Measured for a million elements in the list
>
> 104,041,264 bytes allocated in the heap
> 16,728 bytes copied during GC
> 35,992 bytes maximum residency (1 sample(s))
> 21,352 bytes maximum slop
> 1 MB total memory in use (0 MB lost due to fragmentation)
>
>
> IO version:
> f [] = return []
> f (x : xs) = do
> rest <- f xs
> return $ x : rest
>
> time 79.66 ms (75.49 ms .. 82.55 ms)
>
> 208,654,560 bytes allocated in the heap
> 121,045,336 bytes copied during GC
> 27,679,344 bytes maximum residency (8 sample(s))
> 383,376 bytes maximum slop
> 66 MB total memory in use (0 MB lost due to fragmentation)
>
> Even though this is a small program not doing much and therefore enhancing even small differences to a great degree, I am not sure if I can completely explain the difference in slowness of the order of 7.5x by just the recursive vs lazy building of the list. I am wondering if there is anything that is worth further investigating and improving here.
>
> -harendra
>
> On 12 May 2016 at 05:41, Dan Doel <dan.doel@gmail.com> wrote:
> >
> > On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
> > <harendra.kumar@gmail.com> wrote:
> > > Thanks Dan, that helped. I did notice and suspect the update frame and the
> > > unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was
> > > not sure what is going on. In fact I thought that the intermediate tuple
> > > should get optimized out since it is required only because of the realworld
> > > token which is not real. But it might be difficult to see that at this
> > > level?
> >
> > The token exists as far as the STG level is concerned, I think,
> > because that is the only thing ensuring that things happen in the
> > right order. And the closure must be built to have properly formed
> > STG. So optimizing away the closure building would have to happen at a
> > level lower than STG, and I guess there is no such optimization. I'm
> > not sure how easy it would be to do.
> >
> > > What you are saying may be true for the current implementation but in theory
> > > can we eliminate the intermediate closure?
> >
> > I don't know. I'm a bit skeptical that building this one closure is
> > the only thing causing your code to be a factor of 5 slower. For
> > instance, another difference in the core is that the recursive call
> > corresponding to the result s2 happens before allocating the
> > additional closure. That is the case statement that unpacks the
> > unboxed tuple. And the whole loop happens this way, so it is actually
> > building a structure corresponding to the entire output list in memory
> > rather eagerly.
> >
> > By contrast, your pure function is able to act in a streaming fashion,
> > if consumed properly, where only enough of the result is built to keep
> > driving the rest of the program. It probably runs in constant space,
> > while your IO-based loop has a footprint linear in the size of the
> > input list (in addition to having slightly more overhead per character
> > because of the one extra thunk), because it is a more eager program.
> >
> > And having an asymptotically larger memory footprint is more likely to
> > explain a 5x slowdown than allocating one extra thunk for each list
> > concatenation, I think. (Of course, it could also be some other
> > factor, as well.)
> >
> > You should probably run with +RTS -s (compiling with -rtsopts), and
> > see if the IO version has a much larger maximum residency.
> >
> > Anyhow, this is fundamental to writing the algorithm using IO. It's an
> > algorithm that's a sequence of steps that happen in order, the number
> > of steps is proportional to the input list, and part of those steps is
> > putting stuff in memory for the results.
> >
> > > So why is that? Why can't we generate (++) inline similar to (:)? How do we
> > > make this decision? Is there a theoretical reason for this or this is just
> > > an implementation artifact?
> >
> > The difference between these two is that (++) is a function call, and
> > (:) is a constructor. Constructors are a special case, and don't need
> > to have all the provisions that functions in general do. The best way
> > to learn what the differences are is probably to read the paper about
> > the STG machine.
> >
> > Hope this is a more fruitful lead, but it may be that there's not a
> > lot that can be done, without doing some baroque things to your
> > algorithm, because of the necessity of its using IO.
> >
> > -- Dan