
Hi, I only just noticed this discussion. Essentially, I think you have arrived at the right conclusion regarding URIs. For more background, the IRI document makes interesting reading in this context: http://tools.ietf.org/html/rfc3987; esp. sections 2, 2.1. The IRI is defined in terms of Unicode characters, which themselves may be described/referenced in terms of their code points, but the character encoding is not prescribed. In practice, I think systems are increasingly using UTF-8 for transmitting IRIs and URIs, and using either UTF-8 or UTF-16 for internal storage. There is still a legacy of ISO-8859-1 being defined asthe default charset for HTML (cf. http://www.w3.org/International/O-HTTP-charset for further discussiomn). #g -- On 14/03/2012 06:43, Jason Dusek wrote:
2012/3/12 Jeremy Shaw
: On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek
wrote: Well, to quote one example from RFC 3986:
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
Right. This describes how to convert an octet into a sequence of characters, since the only thing that can appear in a URI is sequences of characters.
The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
Not sure what you mean by this. As the RFC says, a URI is defined entirely by the identity of the characters that are used. There is definitely no single, correct byte sequence for representing a URI. If I give you a sequence of bytes and tell you it is a URI, the only way to decode it is to first know what encoding the byte sequence represents.. ascii, utf-16, etc. Once you have decoded the byte sequence into a sequence of characters, only then can you parse the URI.
Mr. Shaw,
Thanks for taking the time to explain all this. It's really helped me to understand a lot of parts of the URI spec a lot better. I have deprecated my module in the latest release
http://hackage.haskell.org/package/URLb-0.0.1
because a URL parser working on bytes instead of characters stands out to me now as a confused idea.
-- Jason Dusek pgp /// solidsnack 1FD4C6C1 FED18A2B
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe