blaze-builder and FlexibleInstances in code that aims to become part of the Haskell platform

Hello Haskell-Cafe, my main question is whether requiring FlexibleInstances is a problem for code that aims to become part of the Haskell platform. The following explanation gives the context for this question. As some of you may know the blaze-builder library is now used in quite a few places. That's nice, but it doesn't mean that blaze-builder is a finished solution to the problem of providing an API for high-performance buffered output (creation of chunked representations) of sequences of bytes. In fact, one of my current goals with this work is to polish it such that it can be integrated into the 'bytestring' library. This has the benefit that functions that create lazy bytestrings (e.g., pack, map, unfoldr, filter) can be implemented such that they create well-sized chunks even if the argument bytestring is hugely fragmented. Moreover, this integration also establishes a single builder type as the output representation. Therefore, other creators of bytestrings (e.g., 'text', 'base16-bytestring', 'zlib') can provide results of type Builder, which enables O(1) appends of their results and the preservation of well-sizedness of the created chunks. As part of this goal, I'm currently working on the first of the following three points that are paramount to achieving great encoding performance: 1. Ensure that individual Haskell values are encoded with minimal overhead. 2. Ensure that concatenation of sequences of bytes is efficient. 3. Ensure that the average chunk size is large. The core principle used to tackle (1) is avoiding intermediate data structures. The core abstraction used is the one of a Write (see [1] for the corresponding library.) data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8)) A value `Write bound io :: Write a` denotes an encoding scheme for values of type `a` that uses at most `bound` bytes space. Given a values `x :: a` and a pointer `po` to the next free byte `io x po` encodes `x` to memory starting from `po` and returns the pointer to the next free byte after the encoding of `x`. In most cases Writes are used as an abstract datatype. They serve as an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through streams of Haskell values. Hence, typical users of Writes are functions like fromWrite :: Write a -> a -> Builder fromWriteList :: Write a -> [a] -> Builder fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder mapWriteByteString :: Write Word8 -> S.ByteString -> Builder They consume the given datastructure efficiently and wrap the Writes in the bounds checking code to detect when a buffer is full and request a new one. There are many providers of Writes. Each bounded-length-encoding of a standard Haskell value is likely to have a corresponding Write. For example, encoding an Int32 as a big-endian, little-endian, and host-endian byte-sequence is currently achieved with the following three functions. writeInt32BE :: Write Int32 writeInt32LE :: Write Int32 writeInt32HE :: Write Int32 I would like to avoid naming all these encodings individually. Especially, as the situation becomes worse for more elaborate encodings like hexadecimal encodings. There, we encounter encodings like the utf8-encoding of the hexadecimal-encoding with lower-case letters of an Int32. writeInt32HexLowerUtf8 :: Write Int32 I really don't like that. Therefore, I'm thinking about the following solution based on type-classes. We introduce a single typeclass class Writable a where write :: Write a and use a bunch of newtypes to denote our encodings. newtype Ascii7 a = Ascii7 { unAscii7 :: a } newtype Utf8 a = Utf8 { unUtf8 :: a } newtype HexUpper a = HexUpper { unHexUpper :: a } newtype HexLower a = HexLower { unHexLower :: a } ... Assuming FlexibleInstnaces, we can write encodings like the above hex-encoding as instances instance Write (Utf8 (HexLower Int32)) where write = ... This composes rather nicely and allows the implementations to exploit special properties of the involved data. For example, if we also had a HTML escaping marker newtype Html a = Html { unHtml :: a } Then, the instance instance Write (Utf8 (HTML (HexLower Int32))) where write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i)) exploits that no HTML escaping is required for a hex-number. Assuming FlexibleContexts, the user can also build abbreviations for builders using fixed encodings. utf8 :: Writable (Utf8 a) => a -> Builder utf8 = fromWrite write . Utf8 Note that, on the Builder level, a probably better way would be to have an analogous 'ToBuilder' typeclass to abstract the various encodings. Part of these instances then reuse the corresponding instances from Writable. I think this type-class based interface to select the correct efficient implementation of an encoding is rather nice. However, I don't know if 'FlexibleInstances' and 'FlexibleContexts' are fine to use in code that aims to become part of the Haskell platform. Moreover, I might well overlook some drawbacks of this design. I'm looking forward to your feedback. best regards, Simon [1] https://github.com/meiersi/system-io-write

On Wed, May 18, 2011 at 12:32 PM, Simon Meier
Hello Haskell-Cafe,
There are many providers of Writes. Each bounded-length-encoding of a standard Haskell value is likely to have a corresponding Write. For example, encoding an Int32 as a big-endian, little-endian, and host-endian byte-sequence is currently achieved with the following three functions.
writeInt32BE :: Write Int32 writeInt32LE :: Write Int32 writeInt32HE :: Write Int32
I would like to avoid naming all these encodings individually. Especially, as the situation becomes worse for more elaborate encodings like hexadecimal encodings. There, we encounter encodings like the utf8-encoding of the hexadecimal-encoding with lower-case letters of an Int32.
writeInt32HexLowerUtf8 :: Write Int32
I really don't like that. Therefore, I'm thinking about the following solution based on type-classes. We introduce a single typeclass
class Writable a where write :: Write a
and use a bunch of newtypes to denote our encodings.
newtype Ascii7 a = Ascii7 { unAscii7 :: a } newtype Utf8 a = Utf8 { unUtf8 :: a } newtype HexUpper a = HexUpper { unHexUpper :: a } newtype HexLower a = HexLower { unHexLower :: a } ...
Assuming FlexibleInstnaces, we can write encodings like the above hex-encoding as instances
instance Write (Utf8 (HexLower Int32)) where write = ...
This composes rather nicely and allows the implementations to exploit special properties of the involved data. For example, if we also had a HTML escaping marker
newtype Html a = Html { unHtml :: a }
Then, the instance
instance Write (Utf8 (HTML (HexLower Int32))) where write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i))
If I were authoring the above code, I don't see why that code is any easier to write or easier to read than:
urf8HtmlHexLower i = utf8HexLower i
And if I were using the encoding functions, I would much prefer to see:
urf8HtmlHexLower magicNumber
In my code, instead of:
write $ Utf8 $ HTML $ HexLower magicNumber
In addition, this would be difficult for me as a developer using the proposed library, because I would have no way to know which combinations of newtypes are valid from reading the haddocks. Maybe I'm missing something fundamental, but this approach seems more cumbersome to me as a library author (more boilerplate) and as the user of the library (less clarity in the docs and in the resultant code). Antoine

Hi Antoine, thanks for your feedback.
2011/5/18 Antoine Latter
On Wed, May 18, 2011 at 12:32 PM, Simon Meier
wrote: Hello Haskell-Cafe,
There are many providers of Writes. Each bounded-length-encoding of a standard Haskell value is likely to have a corresponding Write. For example, encoding an Int32 as a big-endian, little-endian, and host-endian byte-sequence is currently achieved with the following three functions.
writeInt32BE :: Write Int32 writeInt32LE :: Write Int32 writeInt32HE :: Write Int32
I would like to avoid naming all these encodings individually. Especially, as the situation becomes worse for more elaborate encodings like hexadecimal encodings. There, we encounter encodings like the utf8-encoding of the hexadecimal-encoding with lower-case letters of an Int32.
writeInt32HexLowerUtf8 :: Write Int32
I really don't like that. Therefore, I'm thinking about the following solution based on type-classes. We introduce a single typeclass
class Writable a where write :: Write a
and use a bunch of newtypes to denote our encodings.
newtype Ascii7 a = Ascii7 { unAscii7 :: a } newtype Utf8 a = Utf8 { unUtf8 :: a } newtype HexUpper a = HexUpper { unHexUpper :: a } newtype HexLower a = HexLower { unHexLower :: a } ...
Assuming FlexibleInstnaces, we can write encodings like the above hex-encoding as instances
instance Write (Utf8 (HexLower Int32)) where write = ...
This composes rather nicely and allows the implementations to exploit special properties of the involved data. For example, if we also had a HTML escaping marker
newtype Html a = Html { unHtml :: a }
Then, the instance
instance Write (Utf8 (HTML (HexLower Int32))) where write (Utf8 (HTML (HexLower i))) = write (Utf8 (HexLower i))
If I were authoring the above code, I don't see why that code is any easier to write or easier to read than:
urf8HtmlHexLower i = utf8HexLower i
And if I were using the encoding functions, I would much prefer to see:
urf8HtmlHexLower magicNumber
In my code, instead of:
write $ Utf8 $ HTML $ HexLower magicNumber
In addition, this would be difficult for me as a developer using the proposed library, because I would have no way to know which combinations of newtypes are valid from reading the haddocks.
Maybe I'm missing something fundamental, but this approach seems more cumbersome to me as a library author (more boilerplate) and as the user of the library (less clarity in the docs and in the resultant code).
Hmm, that's a valid point you raise here. Especially, the documentation issue bothers me. The core problem that drove me towards this solution is the abundance of different IntX and WordX types. Each of them requiring a separate Write for big-endian, little-endian, host-endian, lower-case-hex, and uper-case-hex encodings; i.e., currently, there are int8BE :: Write Int8 int16BE :: Write Int16 int32BE :: Write Int32 ... hexLowerInt8 :: Write Int8 ... and so on. As you can see (http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/B...) this approach clutters the public API quite a bit. Hence, I'm thinking of using a separate type-class for each encoding; i.e., class BigEndian a where bigEndian :: Write a This collapses the big-endian encodings of all 10 bounded-size (signed and unsigned) integer types under a single name with a well-defined semantics. Moreover, it's standard Haskell 98. For the hex-encodings, I'm thinking about providing type-classes class HexLower a where hexLower :: Write a class HexLowerNoLead a where hexLowerNoLead :: Write a ... for ASCII encoding and each of the standard Unicode encodings in a separate module. The user can then select the right ones using qualified imports. In most cases, he won't even need qualification, as mixing different character encodings is seldomly used. What do you think about such an interface? Is there another catch hidden, I'm not seeing? BTW, note that Writes are a pure compile time abstraction and are thought to be completely inlined. In typical, uses cases there's no efficiency overhead stemming from these typeclasses. best regards, Simon

On Thu, May 19, 2011 at 3:06 PM, Simon Meier
The core problem that drove me towards this solution is the abundance of different IntX and WordX types. Each of them requiring a separate Write for big-endian, little-endian, host-endian, lower-case-hex, and uper-case-hex encodings; i.e., currently, there are
int8BE :: Write Int8 int16BE :: Write Int16 int32BE :: Write Int32 ... hexLowerInt8 :: Write Int8 ...
and so on. As you can see (http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/B...) this approach clutters the public API quite a bit. Hence, I'm thinking of using a separate type-class for each encoding; i.e.,
If Johan's work on Data.Binary and rewrite rules works out, then it would cut the exposed API in half, which helps. We could then use the module and package system to further keep the API clean, with builders which output a specific encoding could live in separate modules. This could also keep the names of the functions short, as well. That would require coming up with logical divisions for the functions you're creating, and I don't understand the big picture enough to help with that.
class BigEndian a where bigEndian :: Write a
This collapses the big-endian encodings of all 10 bounded-size (signed and unsigned) integer types under a single name with a well-defined semantics. Moreover, it's standard Haskell 98. For the hex-encodings, I'm thinking about providing type-classes
class HexLower a where hexLower :: Write a
class HexLowerNoLead a where hexLowerNoLead :: Write a
...
for ASCII encoding and each of the standard Unicode encodings in a separate module. The user can then select the right ones using qualified imports. In most cases, he won't even need qualification, as mixing different character encodings is seldomly used.
I think we may be at cross-purposes here, and might not even be discussing the same thing - I would imagine that any sort of 'Builder' type included in the bytestring package would only provide the core combinators for packing data into low-level binary formats, so discussions about text encoding issues, converting to hexidecimal and Html escaping are going above my head. This seems like what the 'text' package was written for - to separate out the construction of textual data from choosing its encoding. Are there use-cases where the 'text' package is too slow for this sort of approach? Take care, Antoine
What do you think about such an interface? Is there another catch hidden, I'm not seeing? BTW, note that Writes are a pure compile time abstraction and are thought to be completely inlined. In typical, uses cases there's no efficiency overhead stemming from these typeclasses.
best regards, Simon

On 5/19/11 5:51 PM, Antoine Latter wrote:
On Thu, May 19, 2011 at 3:06 PM, Simon Meier
wrote: The core problem that drove me towards this solution is the abundance of different IntX and WordX types. Each of them requiring a separate Write for big-endian, little-endian, host-endian, lower-case-hex, and uper-case-hex encodings; i.e., currently, there are
int8BE :: Write Int8 int16BE :: Write Int16 int32BE :: Write Int32 ... hexLowerInt8 :: Write Int8 ...
and so on. As you can see (http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/B...) this approach clutters the public API quite a bit. Hence, I'm thinking of using a separate type-class for each encoding; i.e.,
It seems to me that a better way of handling this would be to explicitly define an ADT (or type-level equivalent) for naming the different format options. That is, something like this: data Endianness = BE | LE | HE data Radix = Binary | Octal | Decimal | Hexadecimal ... data WIFormat = WIFormat { endianness :: {-# UNPACK #-} !Endianness , radix :: {-# UNPACK #-} !Radix ...} class WriteWI a where writeWI :: WIFormat -> Write a If you're sure that you can get rid of the typeclass overhead, then you should be able to get rid of the case analysis on the ADT as well (by making sure to always use writeWI fully saturated). But this way, you only need to deal with one class and it's obvious how to extend it (as opposed to your newtype solution where it's not clear whether the order of newtype wrapping matters, etc). Of course, I'm not advocating that specific ADT for encoding format types. For example, it's only in decimal format where there's any difference between Word* and Int* types, since the signedness never shows up explicitly in binary, oct, or hex representations. There's also the issues you've mentioned about whether hex is upper case or lower case, whether there's a leading sigil like 0 or 0o for oct, or 0x, \x, U+,... for hex. And so on. So you'll need to figure out what all the formats are you want to offer, but it should be straightforward to come up with an ADT like the one above, and then you can just case match on it to choose the specific format. As for the class, if you run into too much type ambiguity and want to avoid the need for type signatures, then you can add an unused argument of type @a@ as is common in other core libraries needing to be H98 compliant. -- Live well, ~wren

2011/5/19 Antoine Latter
On Thu, May 19, 2011 at 3:06 PM, Simon Meier
wrote: The core problem that drove me towards this solution is the abundance of different IntX and WordX types. Each of them requiring a separate Write for big-endian, little-endian, host-endian, lower-case-hex, and uper-case-hex encodings; i.e., currently, there are
int8BE :: Write Int8 int16BE :: Write Int16 int32BE :: Write Int32 ... hexLowerInt8 :: Write Int8 ...
and so on. As you can see (http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/B...) this approach clutters the public API quite a bit. Hence, I'm thinking of using a separate type-class for each encoding; i.e.,
If Johan's work on Data.Binary and rewrite rules works out, then it would cut the exposed API in half, which helps.
We could then use the module and package system to further keep the API clean, with builders which output a specific encoding could live in separate modules. This could also keep the names of the functions short, as well.
That would require coming up with logical divisions for the functions you're creating, and I don't understand the big picture enough to help with that.
class BigEndian a where bigEndian :: Write a
This collapses the big-endian encodings of all 10 bounded-size (signed and unsigned) integer types under a single name with a well-defined semantics. Moreover, it's standard Haskell 98. For the hex-encodings, I'm thinking about providing type-classes
class HexLower a where hexLower :: Write a
class HexLowerNoLead a where hexLowerNoLead :: Write a
...
for ASCII encoding and each of the standard Unicode encodings in a separate module. The user can then select the right ones using qualified imports. In most cases, he won't even need qualification, as mixing different character encodings is seldomly used.
I think we may be at cross-purposes here, and might not even be discussing the same thing - I would imagine that any sort of 'Builder' type included in the bytestring package would only provide the core combinators for packing data into low-level binary formats, so discussions about text encoding issues, converting to hexidecimal and Html escaping are going above my head.
This seems like what the 'text' package was written for - to separate out the construction of textual data from choosing its encoding.
Are there use-cases where the 'text' package is too slow for this sort of approach?
Take care, Antoine
What do you think about such an interface? Is there another catch hidden, I'm not seeing? BTW, note that Writes are a pure compile time abstraction and are thought to be completely inlined. In typical, uses cases there's no efficiency overhead stemming from these typeclasses.
best regards, Simon
Yes, for example using the current 'text' package is sup-optimal for dyamically generating UTF-8 encoded HTML pages. The job is simple: the data which is originally held in standard Haskell types (e.g., String) needs to be HTML escaped and UTF-8 encoded and sprinkled with tags in between. For blaze-html using blaze-builder the cost for a tag is a memcpy of the corresponding tag and the cost for a single character is one call to the nested case statement determining if the char needs to be escaped (one memcpy of its escaped version) or what bytes need to be written for UTF-8 encoding the char. This solution works with a single output buffer. For a solution using the text library the cost of creating the underlying UTF-16 array is similar to the cost for blaze-builder. However, you now also need to UTF-8 encode the UTF-16 array. This costs you more than double, as now you also have to inspect every character of every tag. For ~50% of your data you suddenly have to spend a lot more effort! I agree that the text library is a good choice for representing Unicode data of an application. However, for high-performance applications it pays off to think of its output in binary form and exploit the offered shortcuts. That's where blaze-builder and the like come in. thanks for your input, Simon

On Wed, May 18, 2011 at 12:32 PM, Simon Meier
I think this type-class based interface to select the correct efficient implementation of an encoding is rather nice. However, I don't know if 'FlexibleInstances' and 'FlexibleContexts' are fine to use in code that aims to become part of the Haskell platform. Moreover, I might well overlook some drawbacks of this design.
I forgot to answer your main question :-) I don't have a problem with these extensions being in the Haskell Platform, as the platform currently only targets GHC, but the bytestring package itself might have a higher standard of portability. Have you heard from the 'bytestring' maintainers? Antoine

On Wed, 18 May 2011 23:21:27 +0200, Antoine Latter
I don't have a problem with these extensions being in the Haskell Platform, as the platform currently only targets GHC, but the bytestring package itself might have a higher standard of portability.
So you want the packages that use extensions thrown out, as soon as another Haskell compiler becomes popular? As one of the targets of the Haskell Platform is stability, it is not advisable to target just one compiler. Regards, Henk-Jan van Tuyl -- http://Van.Tuyl.eu/ http://members.chello.nl/hjgtuyl/tourdemonad.html --

On May 19, 2011 4:57 AM, "Henk-Jan van Tuyl"
On Wed, 18 May 2011 23:21:27 +0200, Antoine Latter
wrote:
I don't have a problem with these extensions being in the Haskell Platform, as the platform currently only targets GHC, but the bytestring package itself might have a higher standard of portability.
So you want the packages that use extensions thrown out, as soon as
another Haskell compiler becomes popular? As one of the targets of the Haskell Platform is stability, it is not advisable to target just one compiler.
Portability and adherence to standards is a goal worth striving for, but the platform policy stated on the wiki is that packages in the platform should build on all compilier targets. I think we need to be pragmatic about what we include - for example functional dependencies are still controversial, but that doesn't mean that the 'mtl' package should be tossed out of the platform. But perhaps it does mean that, for example, the 'containers' package should be subject to a higher level of scutiny for its public API. The 'bytestring' package might also be such a package where we prioritize portability. I'm not active in the maintaince of the platform; perhaps I'm mis-stating the goals and policies. Antoine

Hi Simon,
On Wed, May 18, 2011 at 7:32 PM, Simon Meier
In fact, one of my current goals with this work is to polish it such that it can be integrated into the 'bytestring' library.
We should definitely add a builder monoid in the bytestring package. Since Write mentions IO, I thought I should point out that we need to separate any code that mentions IO from the the code that doesn't (i.e. the pure Builder API). The use of IO is an implementation detail in bytestring. We should follow the existing bytestring pattern and put any code that mentions IO in e.g. Data.ByteString.Lazy.Builder.Internal. This allows the few people who need to access the internals to do so while making it clear that these are in fact internals. Long term we'd like to switch bytestring over from ForeignPtr to ByteArray#, if possible. There are currently some technical obstacles to such a switch, but factoring out the IO code at least makes it somewhat easier if we ever get around to switching. Avoiding IO in the main API means that the main builder type must not mention IO (or things related to IO, such as Storable).
The core principle used to tackle (1) is avoiding intermediate data structures. The core abstraction used is the one of a Write (see [1] for the corresponding library.)
data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8))
A value `Write bound io :: Write a` denotes an encoding scheme for values of type `a` that uses at most `bound` bytes space. Given a values `x :: a` and a pointer `po` to the next free byte `io x po` encodes `x` to memory starting from `po` and returns the pointer to the next free byte after the encoding of `x`.
In most cases Writes are used as an abstract datatype. They serve as an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through streams of Haskell values. Hence, typical users of Writes are functions like
fromWrite :: Write a -> a -> Builder fromWriteList :: Write a -> [a] -> Builder fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder mapWriteByteString :: Write Word8 -> S.ByteString -> Builder
We want to allow users to efficiently create new builders, for their own data type. This is crucial as the bytestring package cannot provide efficient builders for every possible type, as it would have to depend on most of Hackage (i.e. on all packages that define types that we want efficient builders for) to do so. Allowing the user to get hold of the underlying buffer in a controlled way makes the builder extensible. This is good. Write achieves this separation, but it has some costs which I'm not entirely comfortable with. First, it leads to lots of API duplication. For every type (e.g. Word) we want to be able serialize we have two morally identical functions writeWordhost :: Word -> Write fromWordhost :: Word -> Builder in the API, where the latter simply calls the former and does some additional "wrapping". See http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/B... for examples. Simon, is the reason for this duplication this comment on top of Blaze.ByteString.Builder.Word? "Note that for serializing a three tuple (x,y,z) of bytes (or other word values) you should use the expression fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z instead of fromWord8 x `mappend` fromWord8 y `mappend` fromWord z The first expression will result in a single atomic write of three bytes, while the second expression will check for each byte, if there is free space left in the output buffer. Coalescing these checks can improve performance quite a bit, as long as you use it sensibly." Coalescing of buffer space checks can be achieved without separating writes into Write and Builder. I've done so in the binary package [1] using rewrite rules. The rewrite rules fire reliable so that any "syntactic" series of puts i.e. f = do putWord8 1 putWord8 2 putWord8 3 result in one bounds check, followed by three pokes into the buffer. To do so all that is needed is to define all builders in terms of writeAtMost :: Int -> (Ptr Word8 -> IO Int) -> Builder and create a rewrite rule for append/writeAtMost. writeAtMost is essentially the same as your Write [2], except it never leads to any constructors getting allocated. At the moment, the addition of Write means that import Blaze.ByteString.Builder f :: [Word8] -> Builder f xs = fromWriteList writeWord8 xs is faster than the Data.Binary equivalent import Data.Binary.Builder g :: [Word8] -> Builder g [] = mempty g (x:xs) = singleton x `mappend` g xs Fortunately this was due to a bug in GHC [3]. After this bug has been fixed I expect Data.Binary to perform on par with Blaze.ByteString.Builder, at least for code that involves loops (blaze-builder might still do a better job with fragmentation). Cheers, Johan 1. https://github.com/kolmodin/binary/ 2. I stole your idea of taking the maximum number of bytes to write instead of the actual number of bytes to write. This allows many more bounds checks to be merged (e.g. when serializing values that use variable length encodings, such as UTF-8). 3. http://hackage.haskell.org/trac/ghc/ticket/4978

Hi Johan,
thanks for the extensive and motivating feedback.
2011/5/19 Johan Tibell
On Wed, May 18, 2011 at 7:32 PM, Simon Meier
wrote: In fact, one of my current goals with this work is to polish it such that it can be integrated into the 'bytestring' library.
We should definitely add a builder monoid in the bytestring package.
Since Write mentions IO, I thought I should point out that we need to separate any code that mentions IO from the the code that doesn't (i.e. the pure Builder API). The use of IO is an implementation detail in bytestring. We should follow the existing bytestring pattern and put any code that mentions IO in e.g. Data.ByteString.Lazy.Builder.Internal. This allows the few people who need to access the internals to do so while making it clear that these are in fact internals. Long term we'd like to switch bytestring over from ForeignPtr to ByteArray#, if possible. There are currently some technical obstacles to such a switch, but factoring out the IO code at least makes it somewhat easier if we ever get around to switching.
Avoiding IO in the main API means that the main builder type must not mention IO (or things related to IO, such as Storable).
I completely agree with you. The system-io-write library [1] and the bytestring fork [2] I'm working on provide separate interfaces for standard and expert users. The naming of the system-io-write library is tentative and can be adapted once it's place is clear.
The core principle used to tackle (1) is avoiding intermediate data structures. The core abstraction used is the one of a Write (see [1] for the corresponding library.)
data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8))
A value `Write bound io :: Write a` denotes an encoding scheme for values of type `a` that uses at most `bound` bytes space. Given a values `x :: a` and a pointer `po` to the next free byte `io x po` encodes `x` to memory starting from `po` and returns the pointer to the next free byte after the encoding of `x`.
In most cases Writes are used as an abstract datatype. They serve as an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through streams of Haskell values. Hence, typical users of Writes are functions like
fromWrite :: Write a -> a -> Builder fromWriteList :: Write a -> [a] -> Builder fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder mapWriteByteString :: Write Word8 -> S.ByteString -> Builder
We want to allow users to efficiently create new builders, for their own data type. This is crucial as the bytestring package cannot provide efficient builders for every possible type, as it would have to depend on most of Hackage (i.e. on all packages that define types that we want efficient builders for) to do so. Allowing the user to get hold of the underlying buffer in a controlled way makes the builder extensible. This is good.
Write achieves this separation, but it has some costs which I'm not entirely comfortable with.
First, it leads to lots of API duplication. For every type (e.g. Word) we want to be able serialize we have two morally identical functions
writeWordhost :: Word -> Write fromWordhost :: Word -> Builder
in the API, where the latter simply calls the former and does some additional "wrapping".
Yes, I agree with this duplication. I'll explain below what we gain from it. Note that I factored out the whole Write stuff into its own library (system-io-write) for the bytestring integration. Therefore, an end-user of bytestring will only see the Builder versions except he's doing more low-level stuff to gain some extra performance.
See http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/B... for examples.
Simon, is the reason for this duplication this comment on top of Blaze.ByteString.Builder.Word?
"Note that for serializing a three tuple (x,y,z) of bytes (or other word values) you should use the expression
fromWrite $ writeWord8 x `mappend` writeWord8 y `mappend` writeWord z
instead of
fromWord8 x `mappend` fromWord8 y `mappend` fromWord z
The first expression will result in a single atomic write of three bytes, while the second expression will check for each byte, if there is free space left in the output buffer. Coalescing these checks can improve performance quite a bit, as long as you use it sensibly."
That's one of the reasons, but not the main one. The core reason is that Write's provide an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through (streams of) Haskell values. For simple traversals like fromWrite :: Write a -> a -> Builder fromWriteList :: Write a -> [a] -> Builder fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder there might be the option that GHC is clever enough and can find the efficient loop. However, for more complicated functions like mapWriteByteString :: Write Word8 -> S.ByteString -> Builder That certainly isn't the case. I'm using quite a few tricks there [3] to enable a tight inner loop with few live variables.
Coalescing of buffer space checks can be achieved without separating writes into Write and Builder. I've done so in the binary package [1] using rewrite rules. The rewrite rules fire reliable so that any "syntactic" series of puts i.e.
f = do putWord8 1 putWord8 2 putWord8 3
result in one bounds check, followed by three pokes into the buffer. To do so all that is needed is to define all builders in terms of
writeAtMost :: Int -> (Ptr Word8 -> IO Int) -> Builder
and create a rewrite rule for append/writeAtMost. writeAtMost is essentially the same as your Write [2], except it never leads to any constructors getting allocated.
Hmm, as far as I see it `writeAtMost` corresponds in blaze-builder to implementing every Builder with `fromWrite <someWrite>`. I don't think that is a good idea because (1) there are many more use-cases for Writes than just sequencing and (2) I don't think GHC is clever enough to optimize all of these use cases well enough. For a hard use-case, have a look at the implementation of `mapWriteByteString` [3]. There, I'm using the static bound on the size of the written data to upper bound the number of input bytes I can process without checking the output bound. In my opinion, Writes and Builders have different use-cases and different semantics. Providing a type modeling Writes makes therefore sense to me. Moreover, note that Writes are built as a compile time abstraction. All their definitions are intended to be completely inlined and care is taken that the inliner also does so. Therefore, they incur no runtime cost.
At the moment, the addition of Write means that
import Blaze.ByteString.Builder
f :: [Word8] -> Builder f xs = fromWriteList writeWord8 xs
is faster than the Data.Binary equivalent
import Data.Binary.Builder
g :: [Word8] -> Builder g [] = mempty g (x:xs) = singleton x `mappend` g xs
Fortunately this was due to a bug in GHC [3]. After this bug has been fixed I expect Data.Binary to perform on par with Blaze.ByteString.Builder, at least for code that involves loops (blaze-builder might still do a better job with fragmentation).
OK. That sounds interesting. Adding rewriting rules that coalesce adjacent calls of `fromWrite` might be worth investigating, as they'd improve the user experience. However, as explained above Writes have uses cases beyond this optimization. Their main use case being an abstraction of bounded-size encodings.
1. https://github.com/kolmodin/binary/ 2. I stole your idea of taking the maximum number of bytes to write instead of the actual number of bytes to write. This allows many more bounds checks to be merged (e.g. when serializing values that use variable length encodings, such as UTF-8). 3. http://hackage.haskell.org/trac/ghc/ticket/4978
best regards, Simon [1] https://github.com/meiersi/system-io-write [2] https://github.com/meiersi/bytestring [3] https://github.com/meiersi/bytestring/blob/master/Data/ByteString/Builder/Wr...

Hi Simon,
On Thu, May 19, 2011 at 10:46 PM, Simon Meier
Write achieves this separation, but it has some costs which I'm not entirely comfortable with.
First, it leads to lots of API duplication. For every type (e.g. Word) we want to be able serialize we have two morally identical functions
writeWordhost :: Word -> Write fromWordhost :: Word -> Builder
in the API, where the latter simply calls the former and does some additional "wrapping".
Yes, I agree with this duplication. I'll explain below what we gain from it. Note that I factored out the whole Write stuff into its own library (system-io-write) for the bytestring integration. Therefore, an end-user of bytestring will only see the Builder versions except he's doing more low-level stuff to gain some extra performance.
There are (at least) two cases where I think the simple Builder API must perform well for it to be usable on its own: simple loops and sequential writes. To be specific, if the following two cases don't compile into near optimal code, there's a compiler bug we should fix. First, a simple loop: f :: [Word8] -> Builder f [] = mempty f (x:xs) = singleton x `mappend` xs This code is already quite low level, there should be enough information here for the compiler to emit a simple loop with one buffer bounds check per iteration. Second, a bunch of sequential writes: g :: Word8 -> Word8 -> Word8 -> Word8 -> Builder g a b c d = singleton `mappend` (b `mappend` (c `mappend` d)) This ought to compile to a single bounds check followed by for memory writes. The user shouldn't have to get more low-level than this in these simple examples. Today this is currently only true for the second example, which we can solve using rewrite rules. The first example doesn't work due to the GHC compiler bug I mentioned.
Simon, is the reason for this duplication this comment on top of Blaze.ByteString.Builder.Word?
"<snip>"
That's one of the reasons, but not the main one. The core reason is that Write's provide an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through (streams of) Haskell values. For simple traversals like
fromWrite :: Write a -> a -> Builder fromWriteList :: Write a -> [a] -> Builder fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder
there might be the option that GHC is clever enough and can find the efficient loop. However, for more complicated functions like
mapWriteByteString :: Write Word8 -> S.ByteString -> Builder
That certainly isn't the case. I'm using quite a few tricks there [3] to enable a tight inner loop with few live variables.
Right. So this argues for having an escape hatch, and I agree we should have one. Write at writeAtMost are both such escape hatches and I believe them to equal in expressiveness. This shouldn't come as a surprise as Write is writeAtMost with one argument reified into into a constructor field: writeAtMost :: Int -> (Ptr Word8 -> IO Int) -> IO () data Write = {-# UNPACK #-} !Int (Ptr Word8 -> IO (Ptr Word8)) (That the second argument of writeAtMost is an Int instead of a Ptr Word8 as in Write is an unimportant difference.) There are some operational differences. * The argument to Write can be inspected at runtime, while the argument to writeAtMost can only be inspected at compile time (by a rewrite rule). * Write might exist at runtime, if it's allocation site cannot be seen by its use site, which hard to guarantee in general (it requires serious staring at Core). This is not the case for writeAtMost, unless it's partially applied. * The second field of Write is lazy. I'm not sure what, if any, implications this might have for how GHC optimizes the code.
In my opinion, Writes and Builders have different use-cases and different semantics. Providing a type modeling Writes makes therefore sense to me. Moreover, note that Writes are built as a compile time abstraction. All their definitions are intended to be completely inlined and care is taken that the inliner also does so. Therefore, they incur no runtime cost.
This is up to the user of the Write abstraction to ensure, as any function that takes a Write as an argument must have the correct INLINE incantations applied to make this happen. Cheers, Johan

2011/5/20 Johan Tibell
Hi Simon,
On Thu, May 19, 2011 at 10:46 PM, Simon Meier
wrote: Write achieves this separation, but it has some costs which I'm not entirely comfortable with.
First, it leads to lots of API duplication. For every type (e.g. Word) we want to be able serialize we have two morally identical functions
writeWordhost :: Word -> Write fromWordhost :: Word -> Builder
in the API, where the latter simply calls the former and does some additional "wrapping".
Yes, I agree with this duplication. I'll explain below what we gain from it. Note that I factored out the whole Write stuff into its own library (system-io-write) for the bytestring integration. Therefore, an end-user of bytestring will only see the Builder versions except he's doing more low-level stuff to gain some extra performance.
There are (at least) two cases where I think the simple Builder API must perform well for it to be usable on its own: simple loops and sequential writes. To be specific, if the following two cases don't compile into near optimal code, there's a compiler bug we should fix. First, a simple loop:
f :: [Word8] -> Builder f [] = mempty f (x:xs) = singleton x `mappend` xs
This code is already quite low level, there should be enough information here for the compiler to emit a simple loop with one buffer bounds check per iteration. Second, a bunch of sequential writes:
g :: Word8 -> Word8 -> Word8 -> Word8 -> Builder g a b c d = singleton `mappend` (b `mappend` (c `mappend` d))
This ought to compile to a single bounds check followed by for memory writes.
The user shouldn't have to get more low-level than this in these simple examples. Today this is currently only true for the second example, which we can solve using rewrite rules. The first example doesn't work due to the GHC compiler bug I mentioned.
I agree with you that simple uses of the Builder API should be optimized well. I think we can also guide the user indirectly by making writing the efficient code even easier than writing the possibly dangerous one. I'm thinking about providing type-classes for standard encodings. For example, class Utf8 a where utf8 :: a -> Builder instance Utf8 String where utf8 = fromWriteList writeCharUtf8 -- further instances: Char, Text This yields another way of navigating around difficult optimization territory.
Simon, is the reason for this duplication this comment on top of Blaze.ByteString.Builder.Word?
"<snip>"
That's one of the reasons, but not the main one. The core reason is that Write's provide an interface between implementors of the low-level bit-twiddling required to efficiently implement encodings like UTF-8 or Base16 and the providers of efficient traversal functions through (streams of) Haskell values. For simple traversals like
fromWrite :: Write a -> a -> Builder fromWriteList :: Write a -> [a] -> Builder fromWriteUnfoldr :: Write b -> (a -> Maybe (b, a)) -> a -> Builder
there might be the option that GHC is clever enough and can find the efficient loop. However, for more complicated functions like
mapWriteByteString :: Write Word8 -> S.ByteString -> Builder
That certainly isn't the case. I'm using quite a few tricks there [3] to enable a tight inner loop with few live variables.
Right. So this argues for having an escape hatch, and I agree we should have one. Write at writeAtMost are both such escape hatches and I believe them to equal in expressiveness. This shouldn't come as a surprise as Write is writeAtMost with one argument reified into into a constructor field:
writeAtMost :: Int -> (Ptr Word8 -> IO Int) -> IO () data Write = {-# UNPACK #-} !Int (Ptr Word8 -> IO (Ptr Word8))
(That the second argument of writeAtMost is an Int instead of a Ptr Word8 as in Write is an unimportant difference.)
There, seems to be a historical artefact here. The new Write abstraction in system-io-write is different from the one used in blaze-builder. It's type is data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8)) This definition ensures that the bound on the number of bytes written is independent of the value being encoded. That's crucial for the implementation of `mapWriteByteString`. It also benefits the other Write combinators, as the bound can always be computed in a data-independent fashion. Inlining, is therefore really sufficient to arrive at a constant bound during compile time. I don't see how this Write type can be emulated using `writeAtMost`, do you?
There are some operational differences.
* The argument to Write can be inspected at runtime, while the argument to writeAtMost can only be inspected at compile time (by a rewrite rule).
* Write might exist at runtime, if it's allocation site cannot be seen by its use site, which hard to guarantee in general (it requires serious staring at Core). This is not the case for writeAtMost, unless it's partially applied.
Hmm, all my Writes are top-level function definitions annotated with {-# INLINE #-}. Moreover, all combinators for Writes are also inlined and all their calls are saturated. Therefore, I thought GHC is capable of optimizing away the pattern matches on the Write constructor. I'm happy to remove Writes, if there's a superior way of sharing the low-level encoding code that they abstract. However, I did peek at Core from time to time and found that the Write constructors were optimized away. I currently see Writes as an expert domain to be used by authors of libraries like bytestring, text, aeson, blaze-html, etc. With appropriate documentation and benchmarks I expect them to be able to make good choices w.r.t. inlining and partial application.
* The second field of Write is lazy. I'm not sure what, if any, implications this might have for how GHC optimizes the code.
In my opinion, Writes and Builders have different use-cases and different semantics. Providing a type modeling Writes makes therefore sense to me. Moreover, note that Writes are built as a compile time abstraction. All their definitions are intended to be completely inlined and care is taken that the inliner also does so. Therefore, they incur no runtime cost.
This is up to the user of the Write abstraction to ensure, as any function that takes a Write as an argument must have the correct INLINE incantations applied to make this happen.
I agree, as said above. best regards, Simon

On Fri, May 20, 2011 at 11:12 PM, Simon Meier
There, seems to be a historical artefact here. The new Write abstraction in system-io-write is different from the one used in blaze-builder. It's type is
data Write a = Write Int (a -> Ptr Word8 -> IO (Ptr Word8))
This definition ensures that the bound on the number of bytes written is independent of the value being encoded. That's crucial for the implementation of `mapWriteByteString`. It also benefits the other Write combinators, as the bound can always be computed in a data-independent fashion. Inlining, is therefore really sufficient to arrive at a constant bound during compile time.
I don't see why this makes a difference, you could still do myWrite x = Write (length x) (\ _ p -> pokePokePoke p x)
I don't see how this Write type can be emulated using `writeAtMost`, do you?
There's no difference, as I showed above. Both can result in data dependent lengths. It's up to the programmer to make sure the length is independent of the value being written, when so desired.
Hmm, all my Writes are top-level function definitions annotated with {-# INLINE #-}. Moreover, all combinators for Writes are also inlined and all their calls are saturated. Therefore, I thought GHC is capable of optimizing away the pattern matches on the Write constructor.
You also need to make all top-level functions non-recursive but from what I remember you did so. The case for Writes is the same as for higher-order arguments, the call site must meet the definition site. So if you have something like: myWrite :: Write Word8 writeList :: Write a -> [a] -> ... f xs = writeList myWrite xs we need to make sure both myWrite and writeList are inlined into f. The case is similar for writeAtMost. The question is what happens if the user ever fails to get everything to inline optimally. In the writeAtMost case just have an indirect function call instead of a direct one. In the Write case we also have extra allocation and indirection. We've had such problems in e.g. attoparsec. While things should inline properly in big programs they rarely do. Same problem exists for fusion where fusion constructors end up in the final program although they should have been eliminated.
I'm happy to remove Writes, if there's a superior way of sharing the low-level encoding code that they abstract. However, I did peek at Core from time to time and found that the Write constructors were optimized away. I currently see Writes as an expert domain to be used by authors of libraries like bytestring, text, aeson, blaze-html, etc. With appropriate documentation and benchmarks I expect them to be able to make good choices w.r.t. inlining and partial application.
I agree. Writes (and writeAtMost) would be the domain of experts. If we expects write to be reused a lot it might make sense to have a separate Write type. Note that I'd be reluctant to see dependencies that involve I/O underneath bytestring as it's designed as a pure data structure library (and is likely to have things involving I/O on top of it). Cheers, Johan

On 19 May 2011 10:53, Johan Tibell
Long term we'd like to switch bytestring over from ForeignPtr to ByteArray#, if possible. There are currently some technical obstacles to such a switch
BTW I'm working with Roman Leshchinskiy to create the vector-bytestring package which provides: type ByteString = Data.Vector.Storable.Vector Word8 and exports the same API as the bytestring package (no support for lazy bytestrings yet) A storable vector still uses a ForeignPtr but maybe this will make the switch to unboxed Vectors (which use ByteArray#) easier. Expect some code to be up somewhere next week. Bas

2011/5/20 Bas van Dijk
On 19 May 2011 10:53, Johan Tibell
wrote: Long term we'd like to switch bytestring over from ForeignPtr to ByteArray#, if possible. There are currently some technical obstacles to such a switch
BTW I'm working with Roman Leshchinskiy to create the vector-bytestring package which provides: type ByteString = Data.Vector.Storable.Vector Word8 and exports the same API as the bytestring package (no support for lazy bytestrings yet)
A storable vector still uses a ForeignPtr but maybe this will make the switch to unboxed Vectors (which use ByteArray#) easier.
Expect some code to be up somewhere next week.
Bas
OK, that sounds interesting. I'm looking forward to your code. One trick, I'm using in the blaze-builder implementation is that the current buffer is denoted by a region (pf :: Ptr Word8, pe :: Ptr Word8) of the next free byte `pf` and the first byte after the buffer `pe`. This only works for pinned ByteArrays/ForeignPtrs. Do you know what the cost of such an array/foreignptr is? Moreover, after creation I could "unpin" the array. Do you perhaps know if thats possible in principle? best regards, Simon

Simon Meier schrieb:
There are many providers of Writes. Each bounded-length-encoding of a standard Haskell value is likely to have a corresponding Write. For example, encoding an Int32 as a big-endian, little-endian, and host-endian byte-sequence is currently achieved with the following three functions.
writeInt32BE :: Write Int32 writeInt32LE :: Write Int32 writeInt32HE :: Write Int32
I would like to avoid naming all these encodings individually.
Maybe this one helps: http://hackage.haskell.org/packages/archive/storable-endian/0.2.4/doc/html/D... ?

2011/5/20 Henning Thielemann
Simon Meier schrieb:
There are many providers of Writes. Each bounded-length-encoding of a standard Haskell value is likely to have a corresponding Write. For example, encoding an Int32 as a big-endian, little-endian, and host-endian byte-sequence is currently achieved with the following three functions.
writeInt32BE :: Write Int32 writeInt32LE :: Write Int32 writeInt32HE :: Write Int32
I would like to avoid naming all these encodings individually.
Maybe this one helps: http://hackage.haskell.org/packages/archive/storable-endian/0.2.4/doc/html/D... ?
Thanks. I didn't know of that package. It for sure provides some valuable input. Moreover, it pointed me to Antoine Latter's byteorder package (http://hackage.haskell.org/package/byteorder). I might use it, although I would rather have the ByteOrder determined at compile time. Shouldn't GHC provide some means to determine this, as it's compiling for a fixed architecture anyway? What experience have you made Antoine when you implemented this package? best regards, Simon
participants (7)
-
Antoine Latter
-
Bas van Dijk
-
Henk-Jan van Tuyl
-
Henning Thielemann
-
Johan Tibell
-
Simon Meier
-
wren ng thornton