Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

19 May 2011

      Hi Simon,

On Wed, May 18, 2011 at 7:32 PM, Simon Meier  wrote:
...
In fact, one of my current goals with this work is to polish it such
that it can be integrated into the 'bytestring' library.
We should definitely add a builder monoid in the bytestring package.

Since Write mentions IO, I thought I should point out that we need to
separate any code that mentions IO from the the code that doesn't
(i.e. the pure Builder API). The use of IO is an implementation detail
in bytestring. We should follow the existing bytestring pattern and
put any code that mentions IO in e.g.
Data.ByteString.Lazy.Builder.Internal. This allows the few people who
need to access the internals to do so while making it clear that these
are in fact internals. Long term we'd like to switch bytestring over
from ForeignPtr to ByteArray#, if possible. There are currently some
technical obstacles to such a switch, but factoring out the IO code at
least makes it somewhat easier if we ever get around to switching.

Avoiding IO in the main API means that the main builder type must not
mention IO (or things related to IO, such as Storable).
...
The core principle used to tackle (1) is avoiding intermediate data
structures.  The core abstraction used is the one of a Write (see [1]
for the corresponding library.)
 data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8))
A value `Write bound io :: Write a` denotes an encoding scheme for
values of type `a` that uses at most `bound` bytes space. Given a
values `x :: a` and a pointer `po` to the next free byte `io x po`
encodes `x` to memory starting from `po` and returns the pointer to
the next free byte after the encoding of `x`.
In most cases Writes are used as an abstract datatype. They serve as
an interface between implementors of the low-level bit-twiddling
required to efficiently implement encodings like UTF-8 or Base16 and
the providers of efficient traversal functions through streams of
Haskell values. Hence, typical users of Writes are functions like
 fromWrite          :: Write a -> a -> Builder
 fromWriteList      :: Write a -> [a] -> Builder
 fromWriteUnfoldr   :: Write b -> (a -> Maybe (b, a)) -> a -> Builder
 mapWriteByteString :: Write Word8 -> S.ByteString -> Builder
We want to allow users to efficiently create new builders, for their
own data type. This is crucial as the bytestring package cannot
provide efficient builders for every possible type, as it would have
to depend on most of Hackage (i.e. on all packages that define types
that we want efficient builders for) to do so. Allowing the user to
get hold of the underlying buffer in a controlled way makes the
builder extensible. This is good.

Write achieves this separation, but it has some costs which I'm not
entirely comfortable with.

First, it leads to lots of API duplication. For every type (e.g. Word)
we want to be able serialize we have two morally identical functions

    writeWordhost :: Word -> Write
    fromWordhost :: Word -> Builder

in the API, where the latter simply calls the former and does some
additional "wrapping".

See http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/B...
for examples.

Simon, is the reason for this duplication this comment on top of
Blaze.ByteString.Builder.Word?

    "Note that for serializing a three tuple (x,y,z) of bytes (or other
    word values) you should use the expression

        fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z

    instead of

        fromWord8 x `mappend` fromWord8 y `mappend` fromWord z

    The first expression will result in a single atomic write of three
    bytes, while the second expression will check for each byte, if
    there is free space left in the output buffer. Coalescing these
    checks can improve performance quite a bit, as long as you use it
    sensibly."

Coalescing of buffer space checks can be achieved without separating
writes into Write and Builder. I've done so in the binary package [1]
using rewrite rules. The rewrite rules fire reliable so that any
"syntactic" series of puts i.e.

f = do
    putWord8 1
    putWord8 2
    putWord8 3

result in one bounds check, followed by three pokes into the buffer.
To do so all that is needed is to define all builders in terms of

    writeAtMost :: Int -> (Ptr Word8 -> IO Int) -> Builder

and create a rewrite rule for append/writeAtMost. writeAtMost is
essentially the same as your Write [2], except it never leads to any
constructors getting allocated.

At the moment, the addition of Write means that

    import Blaze.ByteString.Builder

    f :: [Word8] -> Builder
    f xs = fromWriteList writeWord8 xs

is faster than the Data.Binary equivalent

    import Data.Binary.Builder

    g :: [Word8] -> Builder
    g [] = mempty
    g (x:xs) = singleton x `mappend` g xs

Fortunately this was due to a bug in GHC [3]. After this bug has been
fixed I expect Data.Binary to perform on par with
Blaze.ByteString.Builder, at least for code that involves loops
(blaze-builder might still do a better job with fragmentation).

Cheers,
Johan

1. https://github.com/kolmodin/binary/
2. I stole your idea of taking the maximum number of bytes to write
instead of the actual number of bytes to write. This allows many more
bounds checks to be merged (e.g. when serializing values that use
variable length encodings, such as UTF-8).
3. http://hackage.haskell.org/trac/ghc/ticket/4978

Re: [Haskell-cafe] blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

Johan Tibell