Re: [Haskell-cafe] Why so many strings in Network.URI, System.Posix and similar libraries?

2012/3/11 Thedward Blevins
On Sun, Mar 11, 2012 at 13:33, Jason Dusek
wrote: The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
This claim is at odds with the RFC you quoted:
2. Characters
The URI syntax provides a method of encoding data, presumably for the sake of identifying a resource, as a sequence of characters. The URI characters are, in turn, frequently encoded as octets for transport or presentation. This specification does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters.
(Emphasis is mine)
The RFC is specifically agnostic about serialization. I generally agree that there are a lot of places where ByteString should be used, but I'm not convinced this is one of them.
Hi Thedward, I am CC'ing the list since you raise a good point that, I think, reflects on the discussion broadly. It is true that intent of the spec is to allow encoding of characters and not of bytes: I misread its intent, attending only to the productions. But due to the way URIs interact with character encoding, a general URI parser is constrained to work with ByteStrings, just the same. The RFC "...does not mandate any particular character encoding for mapping between URI characters and the octets used to store or transmit those characters..." and in Section 1.2.1 it is allowed that the encoding of may depend on the scheme: In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. It seems possible for any octet, 0x00..0xFF, to show up in a URI, and it is only after parsing the scheme that we can say whether the octet belongs there are not. Thus a general URI parser can only go as far as splitting into components and percent decoding before handing off to scheme specific validation rules (but that's a big help already!). I've implemented a parser under these principles that handles specifically URLs: http://hackage.haskell.org/package/URLb Although the intent of the spec is to represent characters, I contend it does not succeed in doing so. Is it wise to assume more semantics than are actually there? The Internet and UNIX are full of broken junk; but faithful representation would seem to be better than idealization for those occasions where we must deal with them. I'm not sure the assumption of "textiness" really helps much in practice since the Universal Character Set contains control codes and bidi characters -- data that isn't really text. -- Jason Dusek pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B

On Sun, Mar 11, 2012 at 23:05, Jason Dusek
Although the intent of the spec is to represent characters, I contend it does not succeed in doing so. Is it wise to assume more semantics than are actually there?
It is not; one of the reasons that many experts protested the acceptance of this RFC is because of its incomplete specification (and as a result there are a lot of implementations currently which *do* assume more semantics, not always compatibly with each other). Punycode is "out there" now, but it's a mess and a minefield. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms
participants (2)
-
Brandon Allbery
-
Jason Dusek