
2012/3/11 Brandon Allbery
On Sun, Mar 11, 2012 at 14:33, Jason Dusek
wrote: The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
You might want to take a glance at RFC 3492, though.
RFC 3492 covers Punycode, an approach to internationalized domain names. The relationship of RFC 3986 to the restrictions on the syntax of host names, as given by the DNS, is not simple. On the one hand, we have: This specification does not mandate a particular registered name lookup technology and therefore does not restrict the syntax of reg-name beyond what is necessary for interoperability. The production for reg-name is very liberal about allowable octets: reg-name = *( unreserved / pct-encoded / sub-delims ) However, we also have: The reg-name syntax allows percent-encoded octets in order to represent non-ASCII registered names in a uniform way that is independent of the underlying name resolution technology. Non-ASCII characters must first be encoded according to UTF-8... The argument for representing reg-names as Text is pretty strong since the only representable data under these rules is, indeed, Unicode code points. -- Jason Dusek pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B