Question on optimization

Dear Cafe, In code of the form case as ++ bs of [] -> ... x:xs -> ... under what circumstances is the overhead of cons-cell creation for the elements of `as` that would happen in an eager language avoided? Always? Is the answer the same if `f` is defined by simple list recursion and we call `f (as ++ bs)`? What if `f` is defined in a different module (or comes from the Prelude)? I know that questions of this type can be answered in specific cases by looking at the (intermediate) code generated by the compiler, but more generally, are there some succinct principles that everyday programmers can hang on their cubicle walls that summarize a lot of what they can expect from the compiler in terms of optimizations? Todd Wilson

On Tue, Dec 15, 2020 at 02:05:13PM -0800, Todd Wilson wrote:
In code of the form case as ++ bs of [] -> ... x:xs -> ...
under what circumstances is the overhead of cons-cell creation for the elements of `as` that would happen in an eager language avoided?
In Haskell, even without optimisation, I'd expect the above to run in constant space and time regardless of the length of `as`.
Always? Is the answer the same if `f` is defined by simple list recursion and we call `f (as ++ bs)`? What if `f` is defined in a different module (or comes from the Prelude)?
For example, ghci with no optimisation: λ> let f n = [n..n*n] λ> case f 1000000 ++ f 10 of { [] -> "empty"; x:xs -> show x } "1000000" (0.01 secs, 69,680 bytes) λ> case f 10000000000000 ++ f 10 of { [] -> "empty"; x:xs -> show x } "10000000000000" (0.01 secs, 75,384 bytes) λ> let x = (1 : undefined) ++ [2] (0.01 secs, 63,200 bytes) λ> case x of { [] -> "empty"; x:xs -> show x } "1" (0.01 secs, 64,808 bytes) So at least with "++" the result is expected to be lazy in the tail of `as`. If you interpose some list-generating function: f :: [a] -> [b] called as `f (as ++ bs)`, the result rather depends on how strict `f` is in the tail of its input. If `f` can return an initial segment of its output from just an initial segment of its input, then the same sort of constant space/time behaviour would be expected from `f`.
by looking at the (intermediate) code generated by the compiler, but more generally, are there some succinct principles that everyday programmers can hang on their cubicle walls that summarize a lot of what they can expect from the compiler in terms of optimizations?
This feels more like strict/lazy question to me, than optimised vs. unoptimised. Haskell promises lazy evaluation even when unoptimised. λ> head $ map (+1) $ (1 : undefined) ++ [2] 2 Here, `map` is but one example of an `f` that is lazy in the tail of its input: map _ [] = [] map f (x:xs) = f x : map f xs Generally, any list-to-list function that returns incrementally consumable output given an infinite list, or a usable initial segment from a list with a divergent term is one that necessarily only evaluates some finite initial segment of its input, and is not sensitive to what lies beyond. A naive mental model for (++) is then: [] ++ bs = bs (a:as) ++ bs = a : (as ++ bs) which is easily seen to be lazy in `as`. -- Viktor.

Thanks, Viktor, for your response, but I think you misunderstood my question:
On Tue, Dec 15, 2020 at 3:14 PM Viktor Dukhovni
On Tue, Dec 15, 2020 at 02:05:13PM -0800, Todd Wilson wrote:
In code of the form case as ++ bs of [] -> ... x:xs -> ...
under what circumstances is the overhead of cons-cell creation for the elements of `as` that would happen in an eager language avoided?
In Haskell, even without optimisation, I'd expect the above to run in constant space and time regardless of the length of `as`.
Of course this is constant time and space, but that wasn't what I was asking. Suppose HNF(as) = a : ys. Then the `case` will cause the definition of (++) to unfold once and produce a : (ys ++ bs), taking the second branch with x = a and xs = ys ++ bs, the latter a thunk. My question was whether a cons-cell containing `a` is created in the heap or not -- that's the "overhead" I was asking about. In this case, it seems that the compiler could easily avoid this overhead by inlining the definition of (++) and doing some basic simplification to arrive at
case as of [] -> case bs of [] -> -- 1st branch above x:xs -> -- 2nd branch above x:ys -> let xs = ys ++ bs in -- 2nd branch above
My follow-up questions were about whether this overhead would also be avoided in `f (as ++ bs)` if f was defined in the same module by list recursion, or in another module or the Prelude (say f = length). Here is a similar, but more sophisticated example involving flattening of nested lists, coded three ways:
data Nest a = Nil | Cons a (Nest a) | Nest (Nest a) (Nest a)
flatten :: Nest a -> [a] flatten Nil = [] flatten (Cons x n) = x : flatten n flatten (Nest n1 n2) = flatten n1 ++ flatten n2
flatten2 n = fl [n] where fl :: [Nest a] -> [a] fl [] = [] fl (Nil : ns) = fl ns fl (Cons x n : ns) = x : fl (n:ns) fl (Nest n1 n2 : ns) = fl (n1:n2:ns)
flatten3 n = fl n [] where fl :: Nest a -> [Nest a] -> [a] fl Nil [] = [] fl Nil (n:ns) = fl n ns fl (Cons x n) ns = x : fl n ns fl (Nest n1 n2) ns = fl n1 (n2:ns)
On the surface, the difference between these three functions is how many cons-cells they create, in decreasing order. In `flatten`, elements are also copied during (++) operations a number of times equal to their depth in the original nested list; the other two each copy the original elements once. But does (or can) the compiler simplify `flatten2` more or less to `flatten3`? It seems to hinge on how nested patterns are handled. Again, I wish I had some general principles about what kind of compiler optimizations are applied, so that I could answer such questions in the abstract and write code in an easier-to-understand but seemingly more expensive way knowing that the extra expense was going to be compiled away. --Todd

On Wed, Dec 16, 2020 at 05:04:53PM -0800, Todd Wilson wrote:
Thanks, Viktor, for your response, but I think you misunderstood my question:
Indeed I read much too little into it. Sorry about that.
Here is a similar, but more sophisticated example involving flattening of nested lists, coded three ways:
data Nest a = Nil | Cons a (Nest a) | Nest (Nest a) (Nest a)
flatten :: Nest a -> [a] flatten Nil = [] flatten (Cons x n) = x : flatten n flatten (Nest n1 n2) = flatten n1 ++ flatten n2
I would have gone with a right fold: flatten = foldr k [] where k Nil a = a k (Cons x n) a = x : foldr k a n k (Nest x y) a = foldr k (foldr k a y) x In fact, I'd be rather inclined to do this indirectly by instead defining an instance of Foldable for this structure, and then just using that to do the flattening: instance Foldable Nest where foldr _ z Nil = z foldr f z (Cons x n) = f x (foldr f z n) foldr f z (Nest x y) = foldr f (foldr f z y) x flatten = foldr (:) []
Again, I wish I had some general principles about what kind of compiler optimizations are applied, so that I could answer such questions in the abstract and write code in an easier-to-understand but seemingly more expensive way knowing that the extra expense was going to be compiled away.
I don't have an answer for your original question, but I think you'll find that flattening foldable/traversable data structures is often optimally done with some sort of right fold, which avoids the costs of (++), by producing the first element directly, and then lazily the rest of the structure. The Traversable instance in this case would be: {-# LANGUAGE DerivingFunctor, StandaloneDeriving #-} data Nest a = Nil | Cons a (Nest a) | Nest (Nest a) (Nest a) deriving (Show, Functor) instance Traversable Nest where traverse f Nil = pure Nil traverse f (Cons x n) = Cons <$> f x <*> traverse f n traverse f (Nest x y) = Nest <$> traverse f x <*> traverse f y A complete demo program is below my signature, which produces the expected output: [1,2,3,10,4,20] Just (Cons 2 (Cons 3 (Nest (Cons 4 (Cons 11 Nil)) (Cons 5 (Cons 21 Nil))))) I don't know whether indirection via "foldr = go" with `go` recursive and `foldr` "INLINABLE" and ditto for `traverse` would create further opportunities for optimisation when the instances are in a separate module. Someone else might clear that up, or you could benchmark and see. IIRC, recursive functions don't directly inline without such tweaks, but the tweaks might not help much. -- Viktor. {-# LANGUAGE DeriveFunctor #-} module Main (main) where data Nest a = Nil | Cons a (Nest a) | Nest (Nest a) (Nest a) deriving (Show, Functor) instance Foldable Nest where foldr _ z Nil = z foldr f z (Cons x n) = f x (foldr f z n) foldr f z (Nest x y) = foldr f (foldr f z y) x instance Traversable Nest where traverse f Nil = pure Nil traverse f (Cons x n) = Cons <$> f x <*> traverse f n traverse f (Nest x y) = Nest <$> traverse f x <*> traverse f y flatten :: Nest a -> [a] flatten = foldr (:) [] nested :: Nest Int nested = Cons 1 $ Cons 2 $ Nest (Cons 3 $ Cons 10 Nil) (Cons 4 $ Cons 20 Nil) main :: IO () main = do print $ flatten nested print $ traverse (\i -> Just (i + 1)) nested

On Thu, Dec 17, 2020 at 12:15:19AM -0500, Viktor Dukhovni wrote:
I don't know whether indirection via "foldr = go" with `go` recursive and `foldr` "INLINABLE" and ditto for `traverse` would create further opportunities for optimisation when the instances are in a separate module. Someone else might clear that up, or you could benchmark and see. IIRC, recursive functions don't directly inline without such tweaks, but the tweaks might not help much.
Naive experiments seem to suggest that inlining via a `go` indirection that captures `f` is somewhat helpful, but the real win is using a right fold, vs. your original flatten implementation. Here's a comparison with a somewhat branchy deep structure. Flatten via Foldable: --------------------- 2,281,024 bytes allocated in the heap 3,312 bytes copied during GC 44,408 bytes maximum residency (1 sample(s)) 25,224 bytes maximum slop 17 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.000s 0.000s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.000s 0.000s 0.0005s 0.0005s INIT time 0.000s ( 0.000s elapsed) MUT time 0.002s ( 0.002s elapsed) GC time 0.000s ( 0.000s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.003s ( 0.003s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 1,234,320,346 bytes per MUT second Productivity 68.0% of total user, 68.0% of total elapsed The posted flatten (1000x more allocations! 200x runtime): -------------------- 2,651,027,056 bytes allocated in the heap 11,386,968 bytes copied during GC 164,424 bytes maximum residency (2 sample(s)) 29,320 bytes maximum slop 27 MiB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 104 colls, 0 par 0.015s 0.015s 0.0001s 0.0004s Gen 1 2 colls, 0 par 0.001s 0.001s 0.0007s 0.0010s INIT time 0.001s ( 0.001s elapsed) MUT time 0.715s ( 0.715s elapsed) GC time 0.016s ( 0.016s elapsed) EXIT time 0.000s ( 0.000s elapsed) Total time 0.732s ( 0.732s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 3,705,952,327 bytes per MUT second Productivity 97.7% of total user, 97.7% of total elapsed -- Viktor. {-# LANGUAGE DeriveFunctor #-} module Nest ( Nest(..) ) where data Nest a = Nil | Cons a (Nest a) | Nest (Nest a) (Nest a) deriving (Show, Functor) instance Foldable Nest where {-# INLINE foldr #-} foldr f = go where go z Nil = z go z (Cons x n) = f x (go z n) go z (Nest x y) = go (go z y) x instance Traversable Nest where {-# INLINE traverse #-} traverse f = go where go Nil = pure Nil go (Cons x n) = Cons <$> f x <*> go n go (Nest x y) = Nest <$> go x <*> go y
participants (2)
-
Todd Wilson
-
Viktor Dukhovni