
Hi Simon,
On Wed, May 18, 2011 at 7:32 PM, Simon Meier
In fact, one of my current goals with this work is to polish it such that it can be integrated into the 'bytestring' library.
We should definitely add a builder monoid in the bytestring package. Since Write mentions IO, I thought I should point out that we need to separate any code that mentions IO from the the code that doesn't (i.e. the pure Builder API). The use of IO is an implementation detail in bytestring. We should follow the existing bytestring pattern and put any code that mentions IO in e.g. Data.ByteString.Lazy.Builder.Internal. This allows the few people who need to access the internals to do so while making it clear that these are in fact internals. Long term we'd like to switch bytestring over from ForeignPtr to ByteArray#, if possible. There are currently some technical obstacles to such a switch, but factoring out the IO code at least makes it somewhat easier if we ever get around to switching. Avoiding IO in the main API means that the main builder type must not mention IO (or things related to IO, such as Storable).
The core principle used to tackle (1) is avoiding intermediate data structures. The core abstraction used is the one of a Write (see [1] for the corresponding library.)
data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8))
A value `Write bound io :: Write a` denotes an encoding scheme for values of type `a` that uses at most `bound` bytes space. Given a values `x :: a` and a pointer `po` to the next free byte `io x po` encodes `x` to memory starting from `po` and returns the pointer to the next free byte after the encoding of `x`.
In most cases Writes are used as an abstract datatype. They serve as an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through streams of Haskell values. Hence, typical users of Writes are functions like
fromWrite :: Write a -> a -> Builder fromWriteList :: Write a -> [a] -> Builder fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder mapWriteByteString :: Write Word8 -> S.ByteString -> Builder
We want to allow users to efficiently create new builders, for their own data type. This is crucial as the bytestring package cannot provide efficient builders for every possible type, as it would have to depend on most of Hackage (i.e. on all packages that define types that we want efficient builders for) to do so. Allowing the user to get hold of the underlying buffer in a controlled way makes the builder extensible. This is good. Write achieves this separation, but it has some costs which I'm not entirely comfortable with. First, it leads to lots of API duplication. For every type (e.g. Word) we want to be able serialize we have two morally identical functions writeWordhost :: Word -> Write fromWordhost :: Word -> Builder in the API, where the latter simply calls the former and does some additional "wrapping". See http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/B... for examples. Simon, is the reason for this duplication this comment on top of Blaze.ByteString.Builder.Word? "Note that for serializing a three tuple (x,y,z) of bytes (or other word values) you should use the expression fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z instead of fromWord8 x `mappend` fromWord8 y `mappend` fromWord z The first expression will result in a single atomic write of three bytes, while the second expression will check for each byte, if there is free space left in the output buffer. Coalescing these checks can improve performance quite a bit, as long as you use it sensibly." Coalescing of buffer space checks can be achieved without separating writes into Write and Builder. I've done so in the binary package [1] using rewrite rules. The rewrite rules fire reliable so that any "syntactic" series of puts i.e. f = do putWord8 1 putWord8 2 putWord8 3 result in one bounds check, followed by three pokes into the buffer. To do so all that is needed is to define all builders in terms of writeAtMost :: Int -> (Ptr Word8 -> IO Int) -> Builder and create a rewrite rule for append/writeAtMost. writeAtMost is essentially the same as your Write [2], except it never leads to any constructors getting allocated. At the moment, the addition of Write means that import Blaze.ByteString.Builder f :: [Word8] -> Builder f xs = fromWriteList writeWord8 xs is faster than the Data.Binary equivalent import Data.Binary.Builder g :: [Word8] -> Builder g [] = mempty g (x:xs) = singleton x `mappend` g xs Fortunately this was due to a bug in GHC [3]. After this bug has been fixed I expect Data.Binary to perform on par with Blaze.ByteString.Builder, at least for code that involves loops (blaze-builder might still do a better job with fragmentation). Cheers, Johan 1. https://github.com/kolmodin/binary/ 2. I stole your idea of taking the maximum number of bytes to write instead of the actual number of bytes to write. This allows many more bounds checks to be merged (e.g. when serializing values that use variable length encodings, such as UTF-8). 3. http://hackage.haskell.org/trac/ghc/ticket/4978