time 79.66 ms (75.49 ms .. 82.55 ms)
Even though this is a small program not doing much and therefore enhancing even small differences to a great degree, I am not sure if I can completely explain the difference in slowness of the order of 7.5x by just the recursive vs lazy building of the list. I am wondering if there is anything that is worth further investigating and improving here.
-harendra
On 12 May 2016 at 05:41, Dan Doel <
dan.doel@gmail.com> wrote:
>
> On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
> <
harendra.kumar@gmail.com> wrote:
> > Thanks Dan, that helped. I did notice and suspect the update frame and the
> > unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was
> > not sure what is going on. In fact I thought that the intermediate tuple
> > should get optimized out since it is required only because of the realworld
> > token which is not real. But it might be difficult to see that at this
> > level?
>
> The token exists as far as the STG level is concerned, I think,
> because that is the only thing ensuring that things happen in the
> right order. And the closure must be built to have properly formed
> STG. So optimizing away the closure building would have to happen at a
> level lower than STG, and I guess there is no such optimization. I'm
> not sure how easy it would be to do.
>
> > What you are saying may be true for the current implementation but in theory
> > can we eliminate the intermediate closure?
>
> I don't know. I'm a bit skeptical that building this one closure is
> the only thing causing your code to be a factor of 5 slower. For
> instance, another difference in the core is that the recursive call
> corresponding to the result s2 happens before allocating the
> additional closure. That is the case statement that unpacks the
> unboxed tuple. And the whole loop happens this way, so it is actually
> building a structure corresponding to the entire output list in memory
> rather eagerly.
>
> By contrast, your pure function is able to act in a streaming fashion,
> if consumed properly, where only enough of the result is built to keep
> driving the rest of the program. It probably runs in constant space,
> while your IO-based loop has a footprint linear in the size of the
> input list (in addition to having slightly more overhead per character
> because of the one extra thunk), because it is a more eager program.
>
> And having an asymptotically larger memory footprint is more likely to
> explain a 5x slowdown than allocating one extra thunk for each list
> concatenation, I think. (Of course, it could also be some other
> factor, as well.)
>
> You should probably run with +RTS -s (compiling with -rtsopts), and
> see if the IO version has a much larger maximum residency.
>
> Anyhow, this is fundamental to writing the algorithm using IO. It's an
> algorithm that's a sequence of steps that happen in order, the number
> of steps is proportional to the input list, and part of those steps is
> putting stuff in memory for the results.
>
> > So why is that? Why can't we generate (++) inline similar to (:)? How do we
> > make this decision? Is there a theoretical reason for this or this is just
> > an implementation artifact?
>
> The difference between these two is that (++) is a function call, and
> (:) is a constructor. Constructors are a special case, and don't need
> to have all the provisions that functions in general do. The best way
> to learn what the differences are is probably to read the paper about
> the STG machine.
>
> Hope this is a more fruitful lead, but it may be that there's not a
> lot that can be done, without doing some baroque things to your
> algorithm, because of the necessity of its using IO.
>
> -- Dan