
2012/3/12 Jeremy Shaw
On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek
wrote: Well, to quote one example from RFC 3986:
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
Right. This describes how to convert an octet into a sequence of characters, since the only thing that can appear in a URI is sequences of characters.
The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
Not sure what you mean by this. As the RFC says, a URI is defined entirely by the identity of the characters that are used. There is definitely no single, correct byte sequence for representing a URI. If I give you a sequence of bytes and tell you it is a URI, the only way to decode it is to first know what encoding the byte sequence represents.. ascii, utf-16, etc. Once you have decoded the byte sequence into a sequence of characters, only then can you parse the URI.
Mr. Shaw, Thanks for taking the time to explain all this. It's really helped me to understand a lot of parts of the URI spec a lot better. I have deprecated my module in the latest release http://hackage.haskell.org/package/URLb-0.0.1 because a URL parser working on bytes instead of characters stands out to me now as a confused idea. -- Jason Dusek pgp /// solidsnack 1FD4C6C1 FED18A2B