
2012/3/12 Jeremy Shaw
The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
Not sure what you mean by this. As the RFC says, a URI is defined entirely by the identity of the characters that are used. There is definitely no single, correct byte sequence for representing a URI. If I give you a sequence of bytes and tell you it is a URI, the only way to decode it is to first know what encoding the byte sequence represents.. ascii, utf-16, etc. Once you have decoded the byte sequence into a sequence of characters, only then can you parse the URI.
Hmm. Well, I have been reading the spec the other way around: first you parse the URI to get the bytes, then you use encoding information to interpret the bytes. I think this curious passage from Section 2.5 is interesting to consider here: For most systems, an unreserved character appearing within a URI component is interpreted as representing the data octet corresponding to that character's encoding in US-ASCII. Consumers of URIs assume that the letter "X" corresponds to the octet "01011000", and even when that assumption is incorrect, there is no harm in making it. A system that internally provides identifiers in the form of a different character encoding, such as EBCDIC, will generally perform character translation of textual identifiers to UTF-8 [STD63] (or some other superset of the US-ASCII character encoding) at an internal interface, thereby providing more meaningful identifiers than those resulting from simply percent-encoding the original octets. I am really not sure how to interpret this. I have been reading '%' in productions as '0b00100101' and I have written my parser this way; but that is probably backwards thinking.
...let's say we have the path segments ["foo", "bar/baz"] and we wish to use them in the path info of a URI. Because / is a special character it must be percent encoded as %2F. So, the path info for the url would be:
foo/bar%2Fbaz
If we had the path segments, ["foo","bar","baz"], however that would be encoded as:
foo/bar/baz
Now let's look at decoding the path. If we simple decode the percent encoded characters and give the user a ByteString then both urls will decode to:
pack "foo/bar/baz"
Which is incorrect. ["foo", "bar/baz"] and ["foo","bar","baz"] represent different paths. The percent encoding there is required to distinguish between to two unique paths.
I read the section on paths differently: a path is sequence of bytes, wherein slash runs are not permitted, among other rules. However, re-reading the section, a big todo is made about hierarchical data and path normalization; it really seems your interpretation is the correct one. I tried it out in cURL, for example: http://www.ietf.org/rfc%2Frfc3986.txt # 404 Not Found http://www.ietf.org/rfc/rfc3986.txt # 200 OK My recently released released URL parser/pretty-printer is actually wrong in its handling of paths and, when corrected, will only amount to a parser of URLs that are encoded in US-ASCII and supersets thereof. -- Jason Dusek pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B