Re: [Haskell-cafe] Re: [Haskell] ANNOUNCE: Haskell 2010 Report (final)

12 Jul 2010

      On Mon, Jul 12, 2010 at 01:50:01PM +0100, Simon Marlow wrote:
...
Right.  I like the idea of packages being able to declare re-exported  
modules, indeed I considered doing this in GHC (when we needed base3)  
but decided not to mainly because we would still need PackageImports,  
and once you have PackageImports you can implement re-exports using  
that, albeit with more boilerplate.  The advantage of direct re-exports,  
as you say, is that it avoids a conflict when the module is imported.  
However, since some of the modules would conflict anyway, there didn't  
seem to be much advantage in avoiding the conflict in some cases but not  
others.
Well, a main useful case is that I can do -phaskell98 and -phaskell2010
at the same time. So I can make the default jhc behavior be the union of
the two languages easily. However, another motivation is that I want to
be able to easily do things like provide modules like ghc-base-3 and
ghc-base-4 that are simple shims that provide compatible interfaces with
versions of ghc, however, those ghc bases also export some haskell98
modules, so being able to do -phaskell98 -pghc-base-3 at the same time
is important.

There is also some philosophy behind it. I don't like coupling things
that don't need to be coupled, such as the interface a library exports
and the place that interface is implemented.
...
...
I think the ability for a package to define an 'interface' built up from
other re-exported modules will be very useful going forward, I am not
sure how hard something like that would be to implement in ghc, but it
may be worth it in the future.
Isn't that what PackageImports lets you do?
Hmm.. maybe. Jhc doesn't have PackageImports, the idea being that you
re-export modules rather than importing them from a specific package and
exporting their interface. So, it sort of shifts the effort from the
interface makers to the implementors, as in, the Haskell 98 Prelude
will actually have to be in a module called Compat.Haskell98.Prelude and
just re-exported by the 'haskell98' and 'haskell2010' modules. In one
sense 'Compat.Haskell98' sort of acts as a package import, but it isn't
actually tied to a package, it just refers to the
Compat.Haskell98.Prelude in scope. I also have a large aversion to
allowing package names to infect source files carte blanche. It blurs
the line between the language itself and the environment the language
runs in. It may be neccesarry at some point, but I'll put it off until a
case comes up that makes it abundantly clear that it is needed.
...
...
In the past I just had the boehm GC and the cross your fingers and hope
static analysis can catch everything options. But as of 0.7.4 I have a
real gc that I hope to make the new default in the next major release.
(enabled with -fjgc on the command line)
It is an accurate GC that is implemented in portable C. I am using
something based on the paper 'Accurate garbage collection in an
Uncooperative Environment'[1] though the technique was independently
discovered. I always pass the GC parameter as the first argument to
functions, which is mapped to the same register, so I effectively have a
dedicated register without having to resort to a dedicated declared
register in gcc. Plus I can omit the parameter in leaf functions that
don't allocate and free up the register for normal use. I compile with
-mregparm=2 on i386 so the first two arguments to a function get mapped
to registers.
I found that an independent shadow stack actually is faster than using
the linked version described in the paper, (though, still passing around
a pointer to the top of the stack as described), my theory being that
taking the address of a stack allocated object will inhibit certain gcc
optimizations.
The underlying allocator is based on Bonwick's slab allocator[2] which
works quite well for a haskell runtime, I have a slab for each type, so
a slab of 'cons' cells, a slab of size 3 tuples, and so forth.
[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.5570
[2] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.143.4374
Interesting.  This is quite a different set of decisions to the way GHC  
does things, so I look forward to comparing some benchmarks.  My  
intuition is that if allocation is anything more complicated than "reg  
+= size; if (reg > lim) { ... }" then you have a lot of work to do to  
make up for the overhead, but let's see!
Indeed, in certain places I quite conciously made different choices than
ghc, not necessarily because I thought they were better, but more
because I knew they would open up a new design space for me to explore.
Certain places like the RTS I felt much more comfortable being radically
different due to my embedded/operating systems background. Though a lot
of the middle end is pretty much analogous to what ghc does (though jhc
core is based on a PTS instead of system F). My Grin also diverged from
boquist's GRIN quite a bit. Modern architectures and my desire to target
different languages where I may not control the native code generator
have dictated a different set of optimizations. However, the essential
character of it, that it is a proper, pure, monad, remains intact.

Yeah, I didn't realize how important the allocator was until I started
benchmarking, spending time cutting the cost of marking garbage in
half didn't help nearly as much as shaving a few cycles off the
allocator. The fast pass of the allocator is actually very fast, each
object type has its own allocator and free list so allocation is pretty
much just pulling an object off of the free list, it is already of the
appropriate size and its constant fields are pre-initialized as they
arn't cleared during garbage collection. (there is a heuristic that
claims full pages back to the general pool sometimes).

The slow path has to grab a full page and initialize it, but that isn't
really much slower as I can prefetch the cache lines needed so the cost
is on the order of another cache line fill. (thinking about
computational complexity in terms of O(cache line fills) rather than
O(operations) is much more appropriate on todays architectures.).

In any case, something I didn't expect and completely surprised me in a
good way was that adding the garbage collector made a large number of
benchmarks actually run faster over having a simple incrementing pointer
that never reclaimed anything. Apparently it was the indirection short
circuiting that GC performed that did it. Saving that cache line fill
for the redirection node was a huge win that made up for the GC cost.
(cachegrind of the valgrind suite is a great tool for figuring this sort
of thing out). There was probably some benefit from packing objects of
the same type tightly against each other, cons cells are exactly 2 words
and packed right next to each other filling a page, so I can fit more on
the same line, and cells of the same type, allocated at around the same
time probably have some locality of reference.

        John

-- 
John Meacham - ⑆repetae.net⑆john⑈ - http://notanumber.net/