Haskell (Byte)Strings - wrong to separate content from encoding?

Hi More and more libraries use ByteStrings these days. And it is great that we can get fast string handling in Haskell, but is ByteString the right level of abstractions for most uses? It seems to me that libraries, like the Happstack server, should use a string-type which contains both the content (what ByteString contains today) and the encoding. After all, data in a ByteString have no meaning if we do not know its encoding. An example will illustrate my point. If your web-app, implemented with Happstack, receives a request it looks like http://happstack.com/docs/0.4/happstack-server/Happstack-Server-HTTP-Types.h... : data Request = Request { ... rqHeaders :: Headers, ... rqBody :: RqBody ... } newtype RqBody = Body ByteString To actually read the body, you need to find the content-type header, use some encoding-conversion package to actually know what the ByteString means. Furthermore, some other library may need to consume the ByteString. Now you need to know which encoding the consumer expects... But all this seems avoidable if Happstack returned a string-type which included both content and encoding. I could make a similar story about reading text-files. If some data structure contains a lot of small strings, having both encoding and content for each string is wasteful. Thus, I am not suggesting that ByteString should be scraped. Just that ordinarily programmers should not have to think about string encodings. An alternative to having a String type, which contains both content and encoding, would be standardizing on some encoding like UTF-8. I realize that we have the utf8-string package on Hackage, but people (at least Happstack and Network.HTTP) seem to prefer ByteString. I wonder why. Greetings, Mads Lindstrøm

On Fri, 2010-03-19 at 18:45 +0100, Mads Lindstrøm wrote:
Hi
More and more libraries use ByteStrings these days. And it is great that we can get fast string handling in Haskell, but is ByteString the right level of abstractions for most uses?
It seems to me that libraries, like the Happstack server, should use a string-type which contains both the content (what ByteString contains today) and the encoding. After all, data in a ByteString have no meaning if we do not know its encoding.
An example will illustrate my point. If your web-app, implemented with Happstack, receives a request it looks like http://happstack.com/docs/0.4/happstack-server/Happstack-Server-HTTP-Types.h... :
data Request = Request { ... rqHeaders :: Headers, ... rqBody :: RqBody ... }
newtype RqBody = Body ByteString
To actually read the body, you need to find the content-type header, use some encoding-conversion package to actually know what the ByteString means. Furthermore, some other library may need to consume the ByteString. Now you need to know which encoding the consumer expects...
But all this seems avoidable if Happstack returned a string-type which included both content and encoding.
I guess that problem is that... body does not necessary have to be a text. It can as well be a gif, an mp3 etc. So you would need to have something like: data RqBody = Text MIME String | Binary MIME ByteString
I could make a similar story about reading text-files.
If some data structure contains a lot of small strings, having both encoding and content for each string is wasteful. Thus, I am not suggesting that ByteString should be scraped. Just that ordinarily programmers should not have to think about string encodings.
In network programming you have to think about encoding - there is (was) too much sites encoded in IBM codepages (not much problem for English-speaking users). What worst I read some HTML tutorials which suggested that adding meta content-type automatically changes it to ISO encoding ;)
An alternative to having a String type, which contains both content and encoding, would be standardizing on some encoding like UTF-8. I realize that we have the utf8-string package on Hackage, but people (at least Happstack and Network.HTTP) seem to prefer ByteString. I wonder why.
Greetings,
Mads Lindstrøm
Hopefully most of problems are gone as world is moving into utf-8. But still: - Other Unicode coding are used (for example with fixed length of character) - Other data types are used (like binary) In many cases you cannot depend on MIME to be always correct. In some cases you don't need to have character recoding anyway (you store directly to db, you want to compress it). Additionally you may want to compute checksum of string. However recoding UTF-16 -> ... -> UTF16 may change the contents (direction bytes at the beginning) and therefore checksum. Regards
participants (2)
-
Maciej Piechotka
-
Mads Lindstrøm