Time performance with inlining and space performance with newtype?

Hi, I'm in search of clarifications and descriptions about code optimisations that are recommended in GHC docs, regarding inlining and newtype vs data. 1. Inlining small and often used functions See https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/pragmas.html , Section "7.22.6.1. INLINE pragma". It says: "GHC tries to inline functions/values that are 'small enough', thus avoiding the call overhead and possibly exposing other more-wonderful optimisations." I understand the reason for function inlining with other languages like C, as it avoids the construction of new stack frames and coordinating stack pointers for each function call. In Haskell, rather than call stack we've got evaluation stacks. Here are my questions: what exactly is the call overhead of a non-inlned Haskell function? What additional runtime work needs to happen when calling to a non-inlined function? Specifically what other GHC optimisations are possible only if function inlining happens first? Can anyone provide a simple example where 1) a function is not inlined, versus 2) when that function is inlined, which demonstrates two things: 1) a GHC report of the additional optimisations that were applied because of inlining and 2) shorter program runtime as a result of inlining. 2. Calculating the memory costs of a data type As Don Stewart nicely explains on SO http://stackoverflow.com/a/5889784/1526266 , we should always use newtype when we have only a single constructor which has a single field. His example is this: data Book = Book Int Int newtype Book = Book (Int, Int) With newtype, the `Book` construct is guaranteed to be erased at compile time, to have the same representation as `(Int,Int)`. How to I measure the memory space costs of having `Book Int Int` hanging around at runtime, as opposed to just `(Int,Int)`? How many bytes does it cost to store `Book 4 5`? How many bytes does it cost to store `(4,5)` ? Can I ask GHC to tell me how many bytes the storage of each object would require? Or more generally, would looking at GHC Core for all my data structures inform me of the memory space costs for each of them? Thanks -- Rob

I can only answer the questions to the best of my understanding, here goes:
On 9 October 2015 at 17:00, Rob Stewart
Here are my questions: what exactly is the call overhead of a non-inlned Haskell function? What additional runtime work needs to happen when calling to a non-inlined function? Specifically what other GHC optimisations are possible only if function inlining happens first? Can anyone provide a simple example where 1) a function is not inlined, versus 2) when that function is inlined, which demonstrates two things: 1) a GHC report of the additional optimisations that were applied because of inlining and 2) shorter program runtime as a result of inlining.
An example of optimizations becoming available after inlining:
listAdd1 = map (+1)
listAdd2 = map (+2)
listAdd3 = listAdd2 . listAdd1
-- From the RHS of listAdd3
listAdd2 . listAdd1
== map (+2) . map (+1)
== map ((+2) . (+1)) -- List fusion. Also works for types other
than lists (see the talk [1] for more)
== map (\x -> (+2) ((+1) x)) -- Inline the composition dot-operator (.)
== map (\x -> (+2) (x + 1))
== map (\x -> x + 3)
== map (+3)
-- The order and the final result might not be the same, but it will be
operationally equivalent.
-- This prevents having to iterate over the whole list twice.
On 9 October 2015 at 17:00, Rob Stewart
As Don Stewart nicely explains on SO http://stackoverflow.com/a/5889784/1526266 , we should always use newtype when we have only a single constructor which has a single field. His example is this:
data Book = Book Int Int newtype Book = Book (Int, Int)
With newtype, the `Book` construct is guaranteed to be erased at compile time, to have the same representation as `(Int,Int)`. How to I measure the memory space costs of having `Book Int Int` hanging around at runtime, as opposed to just `(Int,Int)`? How many bytes does it cost to store `Book 4 5`? How many bytes does it cost to store `(4,5)` ? Can I ask GHC to tell me how many bytes the storage of each object would require? Or more generally, would looking at GHC Core for all my data structures inform me of the memory space costs for each of them?
I don't think it's possible to look into GHC's memory. Simon Marlow's book [2] explains laziness very well, but even he (he wrote the GHC runtime) says that there is no way to see the actual structure of memory usage. [1]: http://www.techcast.com/events/bigtechday6/odeon-1130/?q=odeon-1130 [2]: http://amzn.com/1449335942 -- Regards Sumit Sahrawat

On 10/09/2015 01:30 PM, Rob Stewart wrote:
Hi,
I'm in search of clarifications and descriptions about code optimisations that are recommended in GHC docs, regarding inlining and newtype vs data.
1. Inlining small and often used functions
See https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/pragmas.html , Section "7.22.6.1. INLINE pragma". It says:
"GHC tries to inline functions/values that are 'small enough', thus avoiding the call overhead and possibly exposing other more-wonderful optimisations."
I understand the reason for function inlining with other languages like C, as it avoids the construction of new stack frames and coordinating stack pointers for each function call. In Haskell, rather than call stack we've got evaluation stacks.
I am not an expert, but I can try to share my understanding of the issue. For one, you have the possibility for other optimisations. Consider: addThree :: Int -> Int -> Int addThree a b c = a + b + c This is compiled to the following core (simplified): addThree = \ a b c -> (+) $dNum_Int ((+) $dNum_Int a b) c (+) is a record selector for the Num dictionary. By inlining (+) and $dNum_Int we get this core: add_Int = \ a b -> case a of I# a# -> case b of I# b# -> I# (a# +# b#) addThree = \ a b c -> add_Int (add_Int a b) c where Int is defined as data Int = I# Int# and (+#) implements primitive (C-like) addition and has type Int# -> Int# -> Int#. add_Int is again a small function and we inline it: addThree = \ a b c -> add_Int (case a of I# a# -> case b of I# b# -> I# (a# +# b#)) c So now we can see the code allocates an Int (the I# constructor) only to pattern match on it immediately again. If we inline the second occurrence of add_Int, we get this: addThree = \ a b c -> case (case a of I# a# -> case b of I# b# -> I# (a# +# b#)) of I# a' -> case c of I# c# -> I# (a' +# c#) Now we can not inline further, but the simplifier can apply transformations it couldn't do previously. For example it can apply the case-of-case transformation: addThree = \ a b c -> case a of I# a# -> case b of I# b# -> case I# (a# +# b#) of I# a' -> case c of I# c# -> I# (a' +# c#) Now it sees it can eliminate the I# constructor: addThree = \ a b c -> case a of I# a# -> case b of I# b# -> case a# +# b# of a' -> case c of I# c# -> I# (a' +# c#) I think GHC now floats the addition to its usage site, so we get: addThree = \ a b c -> case a of I# a# -> case b of I# b# -> case c of I# c# -> I# ((a# +# b#) +# c#) So inlining did not only save some function calls, but we also avoided to allocate a boxed Int value.
Here are my questions: what exactly is the call overhead of a non-inlned Haskell function? What additional runtime work needs to happen when calling to a non-inlined function? Specifically what other GHC optimisations are possible only if function inlining happens first? Can anyone provide a simple example where 1) a function is not inlined, versus 2) when that function is inlined, which demonstrates two things: 1) a GHC report of the additional optimisations that were applied because of inlining and 2) shorter program runtime as a result of inlining.
As I tried to show in the above example, its not about the call overhead, but additional possible simplifications. Inlining a function can make polymorphic arguments monomorphic and provides the possibility to inline other small functions. It can further expose combinations of functions used for list- or stream-fusion (many fusing functions are implemented as unstream . streamedOp . stream with a rule stream . unsteam = id). Without inlining, fusion would not work at all.
2. Calculating the memory costs of a data type
As Don Stewart nicely explains on SO http://stackoverflow.com/a/5889784/1526266 , we should always use newtype when we have only a single constructor which has a single field. His example is this:
data Book = Book Int Int newtype Book = Book (Int, Int)
With newtype, the `Book` construct is guaranteed to be erased at compile time, to have the same representation as `(Int,Int)`. How to I measure the memory space costs of having `Book Int Int` hanging around at runtime, as opposed to just `(Int,Int)`? How many bytes does it cost to store `Book 4 5`? How many bytes does it cost to store `(4,5)` ? Can I ask GHC to tell me how many bytes the storage of each object would require? Or more generally, would looking at GHC Core for all my data structures inform me of the memory space costs for each of them?
There is no difference in your example. Consider instead data BookD = BookD Int newtype BookN = BookN Int BookD consumes 2 pointers (info pointer and pointer to the boxed Int) while BookN has only the size of a boxed Int (a pointer and a machine word for the Int# stored in the boxed Int). So by using a newtype you save 2 pointers and an additional indirection. It also changes the semantics of the code: Because a newtype is "free", it has to be strict in its one and only field. So if you evaluate BookD undefined to WHNF, you will get no exception while evaluating BookN undefined will throw an exception. Back to your example: Here we have either a constructor with two fields, so we need three pointers to store the object (one for the info pointer), or we have a tuple, which also contains two pointers, so it is again three pointers large. The only difference is the newtype can reuse the already existing info table for the (,) data type while the Book data type needs its own info table, so we save a few bytes of generated code (if at all, I do not know if GHC reuses info tables in such a situation).
Thanks
You're welcome.
-- Rob _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Thank you Jonas, both answers were very nicely explained!
On 9 October 2015 at 14:48, Jonas Scholl
On 10/09/2015 01:30 PM, Rob Stewart wrote:
Hi,
I'm in search of clarifications and descriptions about code optimisations that are recommended in GHC docs, regarding inlining and newtype vs data.
1. Inlining small and often used functions
See https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/pragmas.html , Section "7.22.6.1. INLINE pragma". It says:
"GHC tries to inline functions/values that are 'small enough', thus avoiding the call overhead and possibly exposing other more-wonderful optimisations."
I understand the reason for function inlining with other languages like C, as it avoids the construction of new stack frames and coordinating stack pointers for each function call. In Haskell, rather than call stack we've got evaluation stacks.
I am not an expert, but I can try to share my understanding of the issue.
For one, you have the possibility for other optimisations. Consider:
addThree :: Int -> Int -> Int addThree a b c = a + b + c
This is compiled to the following core (simplified):
addThree = \ a b c -> (+) $dNum_Int ((+) $dNum_Int a b) c
(+) is a record selector for the Num dictionary. By inlining (+) and $dNum_Int we get this core:
add_Int = \ a b -> case a of I# a# -> case b of I# b# -> I# (a# +# b#)
addThree = \ a b c -> add_Int (add_Int a b) c
where Int is defined as data Int = I# Int#
and (+#) implements primitive (C-like) addition and has type Int# -> Int# -> Int#.
add_Int is again a small function and we inline it:
addThree = \ a b c -> add_Int (case a of I# a# -> case b of I# b# -> I# (a# +# b#)) c
So now we can see the code allocates an Int (the I# constructor) only to pattern match on it immediately again. If we inline the second occurrence of add_Int, we get this:
addThree = \ a b c -> case (case a of I# a# -> case b of I# b# -> I# (a# +# b#)) of I# a' -> case c of I# c# -> I# (a' +# c#)
Now we can not inline further, but the simplifier can apply transformations it couldn't do previously. For example it can apply the case-of-case transformation:
addThree = \ a b c -> case a of I# a# -> case b of I# b# -> case I# (a# +# b#) of I# a' -> case c of I# c# -> I# (a' +# c#)
Now it sees it can eliminate the I# constructor:
addThree = \ a b c -> case a of I# a# -> case b of I# b# -> case a# +# b# of a' -> case c of I# c# -> I# (a' +# c#)
I think GHC now floats the addition to its usage site, so we get:
addThree = \ a b c -> case a of I# a# -> case b of I# b# -> case c of I# c# -> I# ((a# +# b#) +# c#)
So inlining did not only save some function calls, but we also avoided to allocate a boxed Int value.
Here are my questions: what exactly is the call overhead of a non-inlned Haskell function? What additional runtime work needs to happen when calling to a non-inlined function? Specifically what other GHC optimisations are possible only if function inlining happens first? Can anyone provide a simple example where 1) a function is not inlined, versus 2) when that function is inlined, which demonstrates two things: 1) a GHC report of the additional optimisations that were applied because of inlining and 2) shorter program runtime as a result of inlining.
As I tried to show in the above example, its not about the call overhead, but additional possible simplifications. Inlining a function can make polymorphic arguments monomorphic and provides the possibility to inline other small functions. It can further expose combinations of functions used for list- or stream-fusion (many fusing functions are implemented as unstream . streamedOp . stream with a rule stream . unsteam = id). Without inlining, fusion would not work at all.
2. Calculating the memory costs of a data type
As Don Stewart nicely explains on SO http://stackoverflow.com/a/5889784/1526266 , we should always use newtype when we have only a single constructor which has a single field. His example is this:
data Book = Book Int Int newtype Book = Book (Int, Int)
With newtype, the `Book` construct is guaranteed to be erased at compile time, to have the same representation as `(Int,Int)`. How to I measure the memory space costs of having `Book Int Int` hanging around at runtime, as opposed to just `(Int,Int)`? How many bytes does it cost to store `Book 4 5`? How many bytes does it cost to store `(4,5)` ? Can I ask GHC to tell me how many bytes the storage of each object would require? Or more generally, would looking at GHC Core for all my data structures inform me of the memory space costs for each of them?
There is no difference in your example. Consider instead
data BookD = BookD Int newtype BookN = BookN Int
BookD consumes 2 pointers (info pointer and pointer to the boxed Int) while BookN has only the size of a boxed Int (a pointer and a machine word for the Int# stored in the boxed Int). So by using a newtype you save 2 pointers and an additional indirection. It also changes the semantics of the code: Because a newtype is "free", it has to be strict in its one and only field. So if you evaluate BookD undefined to WHNF, you will get no exception while evaluating BookN undefined will throw an exception.
Back to your example: Here we have either a constructor with two fields, so we need three pointers to store the object (one for the info pointer), or we have a tuple, which also contains two pointers, so it is again three pointers large. The only difference is the newtype can reuse the already existing info table for the (,) data type while the Book data type needs its own info table, so we save a few bytes of generated code (if at all, I do not know if GHC reuses info tables in such a situation).
Thanks
You're welcome.
-- Rob _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

To add third case:
1. which optimising GHC passes does SPECIALIZE enable? Taking the
example from the GHC docs
https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/pragmas.html...
:
hammeredLookup :: Ord key => [(key, value)] -> key -> value
{-# SPECIALIZE hammeredLookup :: [(Widget, value)] -> Widget -> value #-}
This means that the type class dictionary for Ord will not be passed
as a (hidden) argument in Core to `hammeredLookup` when Widget is the
key. The GHC docs about overloading
https://wiki.haskell.org/Performance/Overloading says "The presence of
overloading can prevent many other optimisations, such as Strictness,
from kicking in". So we lose the strictness GHC optimiser in the
absence of SPECIALIZE rules on overloaded functions. Other than
strictness analysis, what other GHC optimisations are being missed
when overloading is not dealt with using monomorphism? And what are
the specific runtime overheads of passing type class constraints as
dictionaries to overloaded functions at runtime?
--
Rob
On 9 October 2015 at 12:30, Rob Stewart
Hi,
I'm in search of clarifications and descriptions about code optimisations that are recommended in GHC docs, regarding inlining and newtype vs data.
1. Inlining small and often used functions
See https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/pragmas.html , Section "7.22.6.1. INLINE pragma". It says:
"GHC tries to inline functions/values that are 'small enough', thus avoiding the call overhead and possibly exposing other more-wonderful optimisations."
I understand the reason for function inlining with other languages like C, as it avoids the construction of new stack frames and coordinating stack pointers for each function call. In Haskell, rather than call stack we've got evaluation stacks.
Here are my questions: what exactly is the call overhead of a non-inlned Haskell function? What additional runtime work needs to happen when calling to a non-inlined function? Specifically what other GHC optimisations are possible only if function inlining happens first? Can anyone provide a simple example where 1) a function is not inlined, versus 2) when that function is inlined, which demonstrates two things: 1) a GHC report of the additional optimisations that were applied because of inlining and 2) shorter program runtime as a result of inlining.
2. Calculating the memory costs of a data type
As Don Stewart nicely explains on SO http://stackoverflow.com/a/5889784/1526266 , we should always use newtype when we have only a single constructor which has a single field. His example is this:
data Book = Book Int Int newtype Book = Book (Int, Int)
With newtype, the `Book` construct is guaranteed to be erased at compile time, to have the same representation as `(Int,Int)`. How to I measure the memory space costs of having `Book Int Int` hanging around at runtime, as opposed to just `(Int,Int)`? How many bytes does it cost to store `Book 4 5`? How many bytes does it cost to store `(4,5)` ? Can I ask GHC to tell me how many bytes the storage of each object would require? Or more generally, would looking at GHC Core for all my data structures inform me of the memory space costs for each of them?
Thanks
-- Rob

On 2015-10-09 07:30 AM, Rob Stewart wrote:
Can I ask GHC to tell me how many bytes the storage of each object would require? Or more generally, would looking at GHC Core for all my data structures inform me of the memory space costs for each of them?
Cmm (-ddump-opt-cmm) is where you see the actual numbers directly. But if you already know the stuff in https://github.com/takenobu-hs/haskell-ghc-illustrated then you can just look at the Haskell level and predict.
participants (4)
-
Albert Y. C. Lai
-
Jonas Scholl
-
Rob Stewart
-
Sumit Sahrawat, Maths & Computing, IIT (BHU)