
Stephen Blackheath wrote:
LAZY EVALUATION (AS IMPLEMENTED BY GHC)
Thank you for this marvelous explanation! I do have a few comments, however, in particular concerning nomenclature and mental model.
A value can either be a literal value, or it can be a thunk. For example, 1 + 1 is a thunk, and assuming the compiler hasn't optimized it, when it's demanded, the thunk is replaced in memory by a literal value '2', and the original thunk can be garbage collected.
The standard nomenclature is *expression*, both 1 + 1 and 2 are expressions. The difference is that 2 is in *normal form*, whereas 1 + 1 is not fully evaluated yet. The standard model of evaluating expression in Haskell is called *graph reduction*, you can find some preliminary material in the wikibook http://en.wikibooks.org/wiki/Haskell/Graph_reduction "Thunk" traditionally refers to the machine representation of unevaluated expressions, usually as a pointer and usually in the context of discussing low level details like indirections and overwriting. I don't use the concept of "thunk" when reasoning about lazy evaluation because it has lots of unnecessary mental baggage, for example it must be a pointer that points to something. This is fine when discussing how GHC implements lazy evaluation, but it's not necessary when discussing how lazy evaluation itself works. Also, "thunk" doesn't distinguish between the different normal forms. In Haskell, the possible normal forms are *weak normal form* and *weak head normal form* (WHNF).
Now let's say we have a list 'let xs = [1..1000000]' which (due to previous IO) has become fully evaluated and therefore takes up memory. Later we get the expression 'length xs'. This is a thunk, and this thunk contains a reference to xs, thus preventing it being garbage collected. Its value is just an integer, but until it's evaluated, it takes up a large amount of space because of the reference to xs. This is an example of a space leak.
It is much clearer here to say that xs is the expression xs = 1 : 2 : 3 : 4 : 5 : ... : 1000000 : [] Clearly, this takes a lot of space to write down and so does n = length (1 : 2 : 3 : ... : 1000000 : []) Compare this to the expression [1..1000000] = enumFromTo 1 1000000 which takes just few bytes to write down. Of course, the normal form of n is 1000000 which does not take much space. Hence, it is desirable to evaluate n to WHNF as soon possible, unless it is completely unneeded.
Another example: 'length [1..1000000]' is a little different. It's a thunk (length ..) referring to a thunk ([1..1000000]). It takes up only a tiny bit of memory, since [1..1000000] is really just a recipe to create a list, not actually a list.
This is where the notion of "thunk" becomes unwieldy. The expression [1..1000000] is actually a list, just not in normal form.
The space behaviour of this expression is good, because [1..1000000] will be lazily created as 'length' consumes it, and the numbers are discarded as it goes.
It would be more precise to say that "the space usage while evaluating the expression to normal form is good".
Now we come to constructors. A newtype has no effect on laziness, e.g.
newtype MyNumber = MyNumber Int
This is strict and exactly equivalent to an Int. However, constructors created with data, the list constructor : and tuples (x,y) are all lazy.
The standard terminology is that data types are only evaluated to weak head normal form, i.e. only to the outermost constructor. Concerning newtypes, there is a subtle difference between a newtype newtype MyNumber = MyNumber Int and a data type with a strict field data MyNumber = MyNumber !Int Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com