Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

12 Mar 2012

      2012/3/12 Jeremy Shaw :
...
...
The syntax of URIs is a mechanism for describing data octets,
not Unicode code points. It is at variance to describe URIs in
terms of Unicode code points.
Not sure what you mean by this. As the RFC says, a URI is defined entirely
by the identity of the characters that are used. There is definitely no
single, correct byte sequence for representing a URI. If I give you a
sequence of bytes and tell you it is a URI, the only way to decode it is to
first know what encoding the byte sequence represents.. ascii, utf-16, etc.
Once you have decoded the byte sequence into a sequence of characters, only
then can you parse the URI.
Hmm. Well, I have been reading the spec the other way around:
first you parse the URI to get the bytes, then you use encoding
information to interpret the bytes. I think this curious passage
from Section 2.5 is interesting to consider here:

   For most systems, an unreserved character appearing within a URI
   component is interpreted as representing the data octet corresponding
   to that character's encoding in US-ASCII.  Consumers of URIs assume
   that the letter "X" corresponds to the octet "01011000", and even
   when that assumption is incorrect, there is no harm in making it.  A
   system that internally provides identifiers in the form of a
   different character encoding, such as EBCDIC, will generally perform
   character translation of textual identifiers to UTF-8 [STD63] (or
   some other superset of the US-ASCII character encoding) at an
   internal interface, thereby providing more meaningful identifiers
   than those resulting from simply percent-encoding the original
   octets.

I am really not sure how to interpret this. I have been reading
'%' in productions as '0b00100101' and I have written my parser
this way; but that is probably backwards thinking.
...
...let's say we have the path segments ["foo", "bar/baz"] and we wish to use
them in the path info of a URI. Because / is a special character it must be
percent encoded as %2F. So, the path info for the url would be:
 foo/bar%2Fbaz
If we had the path segments, ["foo","bar","baz"], however that would be
encoded as:
 foo/bar/baz
Now let's look at decoding the path. If we simple decode the percent encoded
characters and give the user a ByteString then both urls will decode to:
 pack "foo/bar/baz"
Which is incorrect. ["foo", "bar/baz"] and ["foo","bar","baz"] represent
different paths. The percent encoding there is required to distinguish
between to two unique paths.
I read the section on paths differently: a path is sequence of
bytes, wherein slash runs are not permitted, among other rules.
However, re-reading the section, a big todo is made about
hierarchical data and path normalization; it really seems your
interpretation is the correct one. I tried it out in cURL, for
example:

  http://www.ietf.org/rfc%2Frfc3986.txt     # 404 Not Found
  http://www.ietf.org/rfc/rfc3986.txt       # 200 OK

My recently released released URL parser/pretty-printer is
actually wrong in its handling of paths and, when corrected,
will only amount to a parser of URLs that are encoded in
US-ASCII and supersets thereof.

--
Jason Dusek
pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B