
2012/3/11 Jeremy Shaw
Also, URIs are not defined in terms of octets.. but in terms of characters. If you write a URI down on a piece of paper -- what octets are you using? None.. it's some scribbles on a paper. It is the characters that are important, not the bit representation.
Well, to quote one example from RFC 3986: 2.1. Percent-Encoding A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
If you render a URI in a utf-8 encoded document versus a utf-16 encoded document.. the octets will be different, but the meaning will be the same. Because it is the characters that are important. For a URI Text would be a more compact representation than String.. but ByteString is a bit dodgy since it is not well defined what those bytes represent. (though if you use a newtype wrapper around ByteString to declare that it is Ascii, then that would be fine).
This is all fine well and good for what a URI is parsed from and what it is serialized too; but once parsed, the major components of a URI are all octets, pure and simple. Like the "host" part of the authority: host = IP-literal / IPv4address / reg-name ... reg-name = *( unreserved / pct-encoded / sub-delims ) The reg-name production is enough to show that, once the host portion is parsed, it could contain any bytes whatever. ByteString is the only correct representations for a parsed host and userinfo, as well as a parsed path, query or fragment. -- Jason Dusek pgp /// solidsnack 1FD4C6C1 FED18A2B