Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

11 Mar 2012

      2012/3/11 Jeremy Shaw :
...
Also, URIs are not defined in terms of octets.. but in terms
of characters.  If you write a URI down on a piece of paper --
what octets are you using?  None.. it's some scribbles on a
paper. It is the characters that are important, not the bit
representation.
Well, to quote one example from RFC 3986:

  2.1.  Percent-Encoding

   A percent-encoding mechanism is used to represent a data octet in a
   component when that octet's corresponding character is outside the
   allowed set or is being used as a delimiter of, or within, the
   component.

The syntax of URIs is a mechanism for describing data octets,
not Unicode code points. It is at variance to describe URIs in
terms of Unicode code points.
...
If you render a URI in a utf-8 encoded document versus a
utf-16 encoded document.. the octets will be different, but
the meaning will be the same. Because it is the characters
that are important. For a URI Text would be a more compact
representation than String.. but ByteString is a bit dodgy
since it is not well defined what those bytes represent.
(though if you use a newtype wrapper around ByteString to
declare that it is Ascii, then that would be fine).
This is all fine well and good for what a URI is parsed from
and what it is serialized too; but once parsed, the major
components of a URI are all octets, pure and simple. Like the
"host" part of the authority:

  host        = IP-literal / IPv4address / reg-name
  ...
  reg-name    = *( unreserved / pct-encoded / sub-delims )

The reg-name production is enough to show that, once the host
portion is parsed, it could contain any bytes whatever.
ByteString is the only correct representations for a parsed host
and userinfo, as well as a parsed path, query or fragment.

--
Jason Dusek
pgp  ///  solidsnack  1FD4C6C1 FED18A2B