
On Tue, May 10, 2016 at 4:45 AM, Harendra Kumar
Thanks Dan, that helped. I did notice and suspect the update frame and the unboxed tuple but given my limited knowledge about ghc/core/stg/cmm I was not sure what is going on. In fact I thought that the intermediate tuple should get optimized out since it is required only because of the realworld token which is not real. But it might be difficult to see that at this level?
The token exists as far as the STG level is concerned, I think, because that is the only thing ensuring that things happen in the right order. And the closure must be built to have properly formed STG. So optimizing away the closure building would have to happen at a level lower than STG, and I guess there is no such optimization. I'm not sure how easy it would be to do.
What you are saying may be true for the current implementation but in theory can we eliminate the intermediate closure?
I don't know. I'm a bit skeptical that building this one closure is the only thing causing your code to be a factor of 5 slower. For instance, another difference in the core is that the recursive call corresponding to the result s2 happens before allocating the additional closure. That is the case statement that unpacks the unboxed tuple. And the whole loop happens this way, so it is actually building a structure corresponding to the entire output list in memory rather eagerly. By contrast, your pure function is able to act in a streaming fashion, if consumed properly, where only enough of the result is built to keep driving the rest of the program. It probably runs in constant space, while your IO-based loop has a footprint linear in the size of the input list (in addition to having slightly more overhead per character because of the one extra thunk), because it is a more eager program. And having an asymptotically larger memory footprint is more likely to explain a 5x slowdown than allocating one extra thunk for each list concatenation, I think. (Of course, it could also be some other factor, as well.) You should probably run with +RTS -s (compiling with -rtsopts), and see if the IO version has a much larger maximum residency. Anyhow, this is fundamental to writing the algorithm using IO. It's an algorithm that's a sequence of steps that happen in order, the number of steps is proportional to the input list, and part of those steps is putting stuff in memory for the results.
So why is that? Why can't we generate (++) inline similar to (:)? How do we make this decision? Is there a theoretical reason for this or this is just an implementation artifact?
The difference between these two is that (++) is a function call, and (:) is a constructor. Constructors are a special case, and don't need to have all the provisions that functions in general do. The best way to learn what the differences are is probably to read the paper about the STG machine. Hope this is a more fruitful lead, but it may be that there's not a lot that can be done, without doing some baroque things to your algorithm, because of the necessity of its using IO. -- Dan