
Hi Johan,
thanks for the extensive and motivating feedback.
2011/5/19 Johan Tibell
On Wed, May 18, 2011 at 7:32 PM, Simon Meier
wrote: In fact, one of my current goals with this work is to polish it such that it can be integrated into the 'bytestring' library.
We should definitely add a builder monoid in the bytestring package.
Since Write mentions IO, I thought I should point out that we need to separate any code that mentions IO from the the code that doesn't (i.e. the pure Builder API). The use of IO is an implementation detail in bytestring. We should follow the existing bytestring pattern and put any code that mentions IO in e.g. Data.ByteString.Lazy.Builder.Internal. This allows the few people who need to access the internals to do so while making it clear that these are in fact internals. Long term we'd like to switch bytestring over from ForeignPtr to ByteArray#, if possible. There are currently some technical obstacles to such a switch, but factoring out the IO code at least makes it somewhat easier if we ever get around to switching.
Avoiding IO in the main API means that the main builder type must not mention IO (or things related to IO, such as Storable).
I completely agree with you. The system-io-write library [1] and the bytestring fork [2] I'm working on provide separate interfaces for standard and expert users. The naming of the system-io-write library is tentative and can be adapted once it's place is clear.
The core principle used to tackle (1) is avoiding intermediate data structures. The core abstraction used is the one of a Write (see [1] for the corresponding library.)
data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8))
A value `Write bound io :: Write a` denotes an encoding scheme for values of type `a` that uses at most `bound` bytes space. Given a values `x :: a` and a pointer `po` to the next free byte `io x po` encodes `x` to memory starting from `po` and returns the pointer to the next free byte after the encoding of `x`.
In most cases Writes are used as an abstract datatype. They serve as an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through streams of Haskell values. Hence, typical users of Writes are functions like
fromWrite :: Write a -> a -> Builder fromWriteList :: Write a -> [a] -> Builder fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder mapWriteByteString :: Write Word8 -> S.ByteString -> Builder
We want to allow users to efficiently create new builders, for their own data type. This is crucial as the bytestring package cannot provide efficient builders for every possible type, as it would have to depend on most of Hackage (i.e. on all packages that define types that we want efficient builders for) to do so. Allowing the user to get hold of the underlying buffer in a controlled way makes the builder extensible. This is good.
Write achieves this separation, but it has some costs which I'm not entirely comfortable with.
First, it leads to lots of API duplication. For every type (e.g. Word) we want to be able serialize we have two morally identical functions
writeWordhost :: Word -> Write fromWordhost :: Word -> Builder
in the API, where the latter simply calls the former and does some additional "wrapping".
Yes, I agree with this duplication. I'll explain below what we gain from it. Note that I factored out the whole Write stuff into its own library (system-io-write) for the bytestring integration. Therefore, an end-user of bytestring will only see the Builder versions except he's doing more low-level stuff to gain some extra performance.
See http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/B... for examples.
Simon, is the reason for this duplication this comment on top of Blaze.ByteString.Builder.Word?
"Note that for serializing a three tuple (x,y,z) of bytes (or other word values) you should use the expression
fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z
instead of
fromWord8 x `mappend` fromWord8 y `mappend` fromWord z
The first expression will result in a single atomic write of three bytes, while the second expression will check for each byte, if there is free space left in the output buffer. Coalescing these checks can improve performance quite a bit, as long as you use it sensibly."
That's one of the reasons, but not the main one. The core reason is that Write's provide an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through (streams of) Haskell values. For simple traversals like fromWrite :: Write a -> a -> Builder fromWriteList :: Write a -> [a] -> Builder fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder there might be the option that GHC is clever enough and can find the efficient loop. However, for more complicated functions like mapWriteByteString :: Write Word8 -> S.ByteString -> Builder That certainly isn't the case. I'm using quite a few tricks there [3] to enable a tight inner loop with few live variables.
Coalescing of buffer space checks can be achieved without separating writes into Write and Builder. I've done so in the binary package [1] using rewrite rules. The rewrite rules fire reliable so that any "syntactic" series of puts i.e.
f = do putWord8 1 putWord8 2 putWord8 3
result in one bounds check, followed by three pokes into the buffer. To do so all that is needed is to define all builders in terms of
writeAtMost :: Int -> (Ptr Word8 -> IO Int) -> Builder
and create a rewrite rule for append/writeAtMost. writeAtMost is essentially the same as your Write [2], except it never leads to any constructors getting allocated.
Hmm, as far as I see it `writeAtMost` corresponds in blaze-builder to implementing every Builder with `fromWrite <someWrite>`. I don't think that is a good idea because (1) there are many more use-cases for Writes than just sequencing and (2) I don't think GHC is clever enough to optimize all of these use cases well enough. For a hard use-case, have a look at the implementation of `mapWriteByteString` [3]. There, I'm using the static bound on the size of the written data to upper bound the number of input bytes I can process without checking the output bound. In my opinion, Writes and Builders have different use-cases and different semantics. Providing a type modeling Writes makes therefore sense to me. Moreover, note that Writes are built as a compile time abstraction. All their definitions are intended to be completely inlined and care is taken that the inliner also does so. Therefore, they incur no runtime cost.
At the moment, the addition of Write means that
import Blaze.ByteString.Builder
f :: [Word8] -> Builder f xs = fromWriteList writeWord8 xs
is faster than the Data.Binary equivalent
import Data.Binary.Builder
g :: [Word8] -> Builder g [] = mempty g (x:xs) = singleton x `mappend` g xs
Fortunately this was due to a bug in GHC [3]. After this bug has been fixed I expect Data.Binary to perform on par with Blaze.ByteString.Builder, at least for code that involves loops (blaze-builder might still do a better job with fragmentation).
OK. That sounds interesting. Adding rewriting rules that coalesce adjacent calls of `fromWrite` might be worth investigating, as they'd improve the user experience. However, as explained above Writes have uses cases beyond this optimization. Their main use case being an abstraction of bounded-size encodings.
1. https://github.com/kolmodin/binary/ 2. I stole your idea of taking the maximum number of bytes to write instead of the actual number of bytes to write. This allows many more bounds checks to be merged (e.g. when serializing values that use variable length encodings, such as UTF-8). 3. http://hackage.haskell.org/trac/ghc/ticket/4978
best regards, Simon [1] https://github.com/meiersi/system-io-write [2] https://github.com/meiersi/bytestring [3] https://github.com/meiersi/bytestring/blob/master/Data/ByteString/Builder/Wr...