
2011/5/19 Antoine Latter
On Thu, May 19, 2011 at 3:06 PM, Simon Meier
wrote: The core problem that drove me towards this solution is the abundance of different IntX and WordX types. Each of them requiring a separate Write for big-endian, little-endian, host-endian, lower-case-hex, and uper-case-hex encodings; i.e., currently, there are
int8BE :: Write Int8 int16BE :: Write Int16 int32BE :: Write Int32 ... hexLowerInt8 :: Write Int8 ...
and so on. As you can see (http://hackage.haskell.org/packages/archive/blaze-builder/0.3.0.1/doc/html/B...) this approach clutters the public API quite a bit. Hence, I'm thinking of using a separate type-class for each encoding; i.e.,
If Johan's work on Data.Binary and rewrite rules works out, then it would cut the exposed API in half, which helps.
We could then use the module and package system to further keep the API clean, with builders which output a specific encoding could live in separate modules. This could also keep the names of the functions short, as well.
That would require coming up with logical divisions for the functions you're creating, and I don't understand the big picture enough to help with that.
class BigEndian a where bigEndian :: Write a
This collapses the big-endian encodings of all 10 bounded-size (signed and unsigned) integer types under a single name with a well-defined semantics. Moreover, it's standard Haskell 98. For the hex-encodings, I'm thinking about providing type-classes
class HexLower a where hexLower :: Write a
class HexLowerNoLead a where hexLowerNoLead :: Write a
...
for ASCII encoding and each of the standard Unicode encodings in a separate module. The user can then select the right ones using qualified imports. In most cases, he won't even need qualification, as mixing different character encodings is seldomly used.
I think we may be at cross-purposes here, and might not even be discussing the same thing - I would imagine that any sort of 'Builder' type included in the bytestring package would only provide the core combinators for packing data into low-level binary formats, so discussions about text encoding issues, converting to hexidecimal and Html escaping are going above my head.
This seems like what the 'text' package was written for - to separate out the construction of textual data from choosing its encoding.
Are there use-cases where the 'text' package is too slow for this sort of approach?
Take care, Antoine
What do you think about such an interface? Is there another catch hidden, I'm not seeing? BTW, note that Writes are a pure compile time abstraction and are thought to be completely inlined. In typical, uses cases there's no efficiency overhead stemming from these typeclasses.
best regards, Simon
Yes, for example using the current 'text' package is sup-optimal for dyamically generating UTF-8 encoded HTML pages. The job is simple: the data which is originally held in standard Haskell types (e.g., String) needs to be HTML escaped and UTF-8 encoded and sprinkled with tags in between. For blaze-html using blaze-builder the cost for a tag is a memcpy of the corresponding tag and the cost for a single character is one call to the nested case statement determining if the char needs to be escaped (one memcpy of its escaped version) or what bytes need to be written for UTF-8 encoding the char. This solution works with a single output buffer. For a solution using the text library the cost of creating the underlying UTF-16 array is similar to the cost for blaze-builder. However, you now also need to UTF-8 encode the UTF-16 array. This costs you more than double, as now you also have to inspect every character of every tag. For ~50% of your data you suddenly have to spend a lot more effort! I agree that the text library is a good choice for representing Unicode data of an application. However, for high-performance applications it pays off to think of its output in binary form and exploit the offered shortcuts. That's where blaze-builder and the like come in. thanks for your input, Simon