Why so many strings in Network.URI, System.Posix and similar libraries?

The content of URIs is defined in terms of octets in the RFC, and all Posix interfaces are byte streams and C strings, not character strings. Yet in Haskell, we find these objects exposed with String interfaces:
:info Network.URI.URI data URI = URI {uriScheme :: String, uriAuthority :: Maybe URIAuth, uriPath :: String, uriQuery :: String, uriFragment :: String} -- Defined in Network.URI
:info System.Posix.Env.getEnvironment System.Posix.Env.getEnvironment :: IO [(String, String)] -- Defined in System.Posix.Env
But there is no law that environment variables must be made of characters: :; export x=$'\xFF' ; echo -n $x | xxd -p ff :; locale LANG="en_US.UTF-8" That the relationship between bytes and characters can be confusing, both in working with UNIX and in dealing with web protocols, is undeniable -- but it seems unwise to limit the options available to Haskell programmers in dealing with these systems. -- Jason Dusek pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B

It is mostly because those libraries are far older than Text and
ByteString, so String was the only choice at the time. Modernizing them is
good.. but would also break a lot of code. And in many core libraries, the
functions are required to have String types in order to be Haskell 98
compliant.
So, modernization is good. But also requires significant effort, and
someone willing to make that effort.
Also, URIs are not defined in terms of octets.. but in terms of characters.
If you write a URI down on a piece of paper -- what octets are you using?
None.. it's some scribbles on a paper. It is the characters that are
important, not the bit representation. If you render a URI in a utf-8
encoded document versus a utf-16 encoded document.. the octets will be
different, but the meaning will be the same. Because it is the characters
that are important. For a URI Text would be a more compact representation
than String.. but ByteString is a bit dodgy since it is not well defined
what those bytes represent. (though if you use a newtype wrapper around
ByteString to declare that it is Ascii, then that would be fine).
- jeremy
On Sat, Mar 10, 2012 at 9:24 PM, Jason Dusek
The content of URIs is defined in terms of octets in the RFC, and all Posix interfaces are byte streams and C strings, not character strings. Yet in Haskell, we find these objects exposed with String interfaces:
:info Network.URI.URI data URI = URI {uriScheme :: String, uriAuthority :: Maybe URIAuth, uriPath :: String, uriQuery :: String, uriFragment :: String} -- Defined in Network.URI
:info System.Posix.Env.getEnvironment System.Posix.Env.getEnvironment :: IO [(String, String)] -- Defined in System.Posix.Env
But there is no law that environment variables must be made of characters:
:; export x=$'\xFF' ; echo -n $x | xxd -p ff :; locale LANG="en_US.UTF-8"
That the relationship between bytes and characters can be confusing, both in working with UNIX and in dealing with web protocols, is undeniable -- but it seems unwise to limit the options available to Haskell programmers in dealing with these systems.
-- Jason Dusek pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

2012/3/11 Jeremy Shaw
Also, URIs are not defined in terms of octets.. but in terms of characters. If you write a URI down on a piece of paper -- what octets are you using? None.. it's some scribbles on a paper. It is the characters that are important, not the bit representation.
Well, to quote one example from RFC 3986: 2.1. Percent-Encoding A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component. The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
If you render a URI in a utf-8 encoded document versus a utf-16 encoded document.. the octets will be different, but the meaning will be the same. Because it is the characters that are important. For a URI Text would be a more compact representation than String.. but ByteString is a bit dodgy since it is not well defined what those bytes represent. (though if you use a newtype wrapper around ByteString to declare that it is Ascii, then that would be fine).
This is all fine well and good for what a URI is parsed from and what it is serialized too; but once parsed, the major components of a URI are all octets, pure and simple. Like the "host" part of the authority: host = IP-literal / IPv4address / reg-name ... reg-name = *( unreserved / pct-encoded / sub-delims ) The reg-name production is enough to show that, once the host portion is parsed, it could contain any bytes whatever. ByteString is the only correct representations for a parsed host and userinfo, as well as a parsed path, query or fragment. -- Jason Dusek pgp /// solidsnack 1FD4C6C1 FED18A2B

On Sun, Mar 11, 2012 at 14:33, Jason Dusek
The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
You might want to take a glance at RFC 3492, though. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

2012/3/11 Brandon Allbery
On Sun, Mar 11, 2012 at 14:33, Jason Dusek
wrote: The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
You might want to take a glance at RFC 3492, though.
RFC 3492 covers Punycode, an approach to internationalized domain names. The relationship of RFC 3986 to the restrictions on the syntax of host names, as given by the DNS, is not simple. On the one hand, we have: This specification does not mandate a particular registered name lookup technology and therefore does not restrict the syntax of reg-name beyond what is necessary for interoperability. The production for reg-name is very liberal about allowable octets: reg-name = *( unreserved / pct-encoded / sub-delims ) However, we also have: The reg-name syntax allows percent-encoded octets in order to represent non-ASCII registered names in a uniform way that is independent of the underlying name resolution technology. Non-ASCII characters must first be encoded according to UTF-8... The argument for representing reg-names as Text is pretty strong since the only representable data under these rules is, indeed, Unicode code points. -- Jason Dusek pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B

On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek
2012/3/11 Jeremy Shaw
: Also, URIs are not defined in terms of octets.. but in terms of characters. If you write a URI down on a piece of paper -- what octets are you using? None.. it's some scribbles on a paper. It is the characters that are important, not the bit representation.
To quote RFC1738: URLs are sequences of characters, i.e., letters, digits, and special characters. A URLs may be represented in a variety of ways: e.g., ink on paper, or a sequence of octets in a coded character set. The interpretation of a URL depends only on the identity of the characters used. Well, to quote one example from RFC 3986:
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
Right. This describes how to convert an octet into a sequence of characters, since the only thing that can appear in a URI is sequences of characters.
The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
Not sure what you mean by this. As the RFC says, a URI is defined entirely by the identity of the characters that are used. There is definitely no single, correct byte sequence for representing a URI. If I give you a sequence of bytes and tell you it is a URI, the only way to decode it is to first know what encoding the byte sequence represents.. ascii, utf-16, etc. Once you have decoded the byte sequence into a sequence of characters, only then can you parse the URI.
If you render a URI in a utf-8 encoded document versus a utf-16 encoded document.. the octets will be diffiFor example, let's sa= y that we have a unicode string and we want to use it in the URI path.
the meaning will be the same. Because it is the characters that are important. For a URI Text would be a more compact representation than String.. but ByteString is a bit dodgy since it is not well defined what those bytes represent. (though if you use a newtype wrapper around ByteString to declare that it is Ascii, then that would be fine).
This is all fine well and good for what a URI is parsed from and what it is serialized too; but once parsed, the major components of a URI are all octets, pure and simple.
Not quite. We can not, for example, change uriPath to be a ByteString and decode any percent encoded characters for the user, because that would change the meaning of the path and break applications. For example, let's say we have the path segments ["foo", "bar/baz"] and we wish to use them in the path info of a URI. Because / is a special character it must be percent encoded as %2F. So, the path info for the url would be: foo/bar%2Fbaz If we had the path segments, ["foo","bar","baz"], however that would be encoded as: foo/bar/baz Now let's look at decoding the path. If we simple decode the percent encoded characters and give the user a ByteString then both urls will decode to: pack "foo/bar/baz" Which is incorrect. ["foo", "bar/baz"] and ["foo","bar","baz"] represent different paths. The percent encoding there is required to distinguish between to two unique paths. Let's look at another example, Let's say we want to encode the path segments: ["I=E2=9D=A4=CE=BB"] How do we do that? Well.. the RFCs do not mandate a specific way. While a URL is a sequence of characters -- the set of allow characters in pretty restricted. So, we must use some application specific way to transform that string into something that is allowed in a uri path. We could do it by converting all characters to their unicode character numbers like: "u73u2764u03BB" Since the string now only contains acceptable characters, we can easily convert it to a valid uri path. Later when someone requests that url, our application can convert it back to a unicode character sequence. Of course, no one actually uses that method. The commonly used (and I believe, officially endorsed, but not required) method is a bit more complicated. 1. first we take the string "I=E2=9D=A4=CE=BB" and utf-8 encoded it to get= a octet sequence: 49 e2 9d a4 ce bb 2. next we percent encode the bytes to get *back* to a character sequence (such as a String, Text, or Ascii) "I%E2%9D%A4%CE%BB" So, that is character sequence that would appear in the URI. *But* we do not yet have octets that we can transmit over the internet. We only have a sequence of characters. We must now convert those characters into octets. For example, let's say we put the url as an 'href' in an <a> tag in a web page that is UTF-16 encoded. 3. Now we must convert the character sequence to a (big endian) utf-16 octet sequence: 00 49 00 25 00 45 00 32 00 25 00 39 00 44 00 25 00 41 00 34 00 25 00 43 00 45 00 25 00 42 00 42 So those are the octets that actually get embedded in the utf-16 encoded .html document and transmitted over the net. 4. the browser then decodes the utf-16 web page and gets back the sequence of characters: "I%E2%9D%A4%CE%BB" Note that here the browser has a sequence of characters -- we know nothing about how those bytes are represented internally by the browser. If the browser was written in Haskell it might be String or Text. Now let's say the browser wants to request the URL. It *must* encode the url as ASCII (as per the spec). 5. So, the browser encodes the string as the octet sequence 49 25 45 32 25 39 44 25 41 34 25 43 45 25 42 42 6. The server can now decode that sequence of octets back into a sequence of characters: "I%E2%9D%A4%CE%BB" Now, the low-level Network.URI library can not really do much more than that, because it does not know what those octets are really supposed to mean (see the / example above). 7. the application specific code, however, knows that it should now first split the path on any / characters to get ["I%E2%9D%A4%CE%BB"] 8. next it should percent decode each path segment to get a ByteString sequence: 49 e2 9d a4 ce bb 9. And now it can utf-8 decode that octet sequence get a unicode character sequence: I=E2=9D=A4=CE=BB So... the basic gist is that if you unicode characters embedded in an html document, they will generally be encoded *three* different times. (First the unicode characters are converted to a utf-8 byte sequence, then the byte sequence is percent encoded, and then the percent encoded character sequence is encoded as another byte sequence). But, applications can choose to use other methods as well. In terms of applicability to the URI type.. uriPath :: ByteString definitely does not work. It is possible that uriPath :: [ByteString] might work... assuming / is the only special character we need to worry about in the uriPath. But, doing all the breaking on '/' and the percent decoding may not be required for many applications. So, choosing to always do the extra work raises some concerns. Also, even with, uriPath :: [ByteString], we are losing some information. The browser is free to percent encode characters -- even if it is not required. For example the browser could request: "hello" Or it could request: "%68%65%6c%6c%6f" In this case the *meaning* is the same. So, doing the decoding is less problematic. But I wonder if there might still be cases where we still want to distinguished between those two requests? hope this helps. - jeremy

Argh. Email fail.
Hopefully this time I have managed to reply-all to the list *and* keep the
unicode properly intact.
Sorry about any duplicates you may have received.
On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek
2012/3/11 Jeremy Shaw
: Also, URIs are not defined in terms of octets.. but in terms of characters. If you write a URI down on a piece of paper -- what octets are you using? None.. it's some scribbles on a paper. It is the characters that are important, not the bit representation.
To quote RFC1738: URLs are sequences of characters, i.e., letters, digits, and special characters. A URLs may be represented in a variety of ways: e.g., ink on paper, or a sequence of octets in a coded character set. The interpretation of a URL depends only on the identity of the characters used. Well, to quote one example from RFC 3986:
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
Right. This describes how to convert an octet into a sequence of characters, since the only thing that can appear in a URI is sequences of characters.
The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
Not sure what you mean by this. As the RFC says, a URI is defined entirely by the identity of the characters that are used. There is definitely no single, correct byte sequence for representing a URI. If I give you a sequence of bytes and tell you it is a URI, the only way to decode it is to first know what encoding the byte sequence represents.. ascii, utf-16, etc. Once you have decoded the byte sequence into a sequence of characters, only then can you parse the URI.
If you render a URI in a utf-8 encoded document versus a utf-16 encoded document.. the octets will be diffiFor example, let's say that we have a unicode string and we want to use it in the URI path.
the meaning will be the same. Because it is the characters that are important. For a URI Text would be a more compact representation than String.. but ByteString is a bit dodgy since it is not well defined what those bytes represent. (though if you use a newtype wrapper around ByteString to declare that it is Ascii, then that would be fine).
This is all fine well and good for what a URI is parsed from and what it is serialized too; but once parsed, the major components of a URI are all octets, pure and simple.
Not quite. We can not, for example, change uriPath to be a ByteString and decode any percent encoded characters for the user, because that would change the meaning of the path and break applications. For example, let's say we have the path segments ["foo", "bar/baz"] and we wish to use them in the path info of a URI. Because / is a special character it must be percent encoded as %2F. So, the path info for the url would be: foo/bar%2Fbaz If we had the path segments, ["foo","bar","baz"], however that would be encoded as: foo/bar/baz Now let's look at decoding the path. If we simple decode the percent encoded characters and give the user a ByteString then both urls will decode to: pack "foo/bar/baz" Which is incorrect. ["foo", "bar/baz"] and ["foo","bar","baz"] represent different paths. The percent encoding there is required to distinguish between to two unique paths. Let's look at another example, Let's say we want to encode the path segments: ["I❤λ"] How do we do that? Well.. the RFCs do not mandate a specific way. While a URL is a sequence of characters -- the set of allow characters in pretty restricted. So, we must use some application specific way to transform that string into something that is allowed in a uri path. We could do it by converting all characters to their unicode character numbers like: "u73u2764u03BB" Since the string now only contains acceptable characters, we can easily convert it to a valid uri path. Later when someone requests that url, our application can convert it back to a unicode character sequence. Of course, no one actually uses that method. The commonly used (and I believe, officially endorsed, but not required) method is a bit more complicated. 1. first we take the string "I❤λ" and utf-8 encoded it to get a octet sequence: 49 e2 9d a4 ce bb 2. next we percent encode the bytes to get *back* to a character sequence (such as a String, Text, or Ascii) "I%E2%9D%A4%CE%BB" So, that is character sequence that would appear in the URI. *But* we do not yet have octets that we can transmit over the internet. We only have a sequence of characters. We must now convert those characters into octets. For example, let's say we put the url as an 'href' in an <a> tag in a web page that is UTF-16 encoded. 3. Now we must convert the character sequence to a (big endian) utf-16 octet sequence: 00 49 00 25 00 45 00 32 00 25 00 39 00 44 00 25 00 41 00 34 00 25 00 43 00 45 00 25 00 42 00 42 So those are the octets that actually get embedded in the utf-16 encoded .html document and transmitted over the net. 4. the browser then decodes the utf-16 web page and gets back the sequence of characters: "I%E2%9D%A4%CE%BB" Note that here the browser has a sequence of characters -- we know nothing about how those bytes are represented internally by the browser. If the browser was written in Haskell it might be String or Text. Now let's say the browser wants to request the URL. It *must* encode the url as ASCII (as per the spec). 5. So, the browser encodes the string as the octet sequence 49 25 45 32 25 39 44 25 41 34 25 43 45 25 42 42 6. The server can now decode that sequence of octets back into a sequence of characters: "I%E2%9D%A4%CE%BB" Now, the low-level Network.URI library can not really do much more than that, because it does not know what those octets are really supposed to mean (see the / example above). 7. the application specific code, however, knows that it should now first split the path on any / characters to get ["I%E2%9D%A4%CE%BB"] 8. next it should percent decode each path segment to get a ByteString sequence: 49 e2 9d a4 ce bb 9. And now it can utf-8 decode that octet sequence get a unicode character sequence: I❤λ So... the basic gist is that if you unicode characters embedded in an html document, they will generally be encoded *three* different times. (First the unicode characters are converted to a utf-8 byte sequence, then the byte sequence is percent encoded, and then the percent encoded character sequence is encoded as another byte sequence). But, applications can choose to use other methods as well. In terms of applicability to the URI type.. uriPath :: ByteString definitely does not work. It is possible that uriPath :: [ByteString] might work... assuming / is the only special character we need to worry about in the uriPath. But, doing all the breaking on '/' and the percent decoding may not be required for many applications. So, choosing to always do the extra work raises some concerns. Also, even with, uriPath :: [ByteString], we are losing some information. The browser is free to percent encode characters -- even if it is not required. For example the browser could request: "hello" Or it could request: "%68%65%6c%6c%6f" In this case the *meaning* is the same. So, doing the decoding is less problematic. But I wonder if there might still be cases where we still want to distinguished between those two requests? hope this helps.

2012/3/12 Jeremy Shaw
The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
Not sure what you mean by this. As the RFC says, a URI is defined entirely by the identity of the characters that are used. There is definitely no single, correct byte sequence for representing a URI. If I give you a sequence of bytes and tell you it is a URI, the only way to decode it is to first know what encoding the byte sequence represents.. ascii, utf-16, etc. Once you have decoded the byte sequence into a sequence of characters, only then can you parse the URI.
Hmm. Well, I have been reading the spec the other way around: first you parse the URI to get the bytes, then you use encoding information to interpret the bytes. I think this curious passage from Section 2.5 is interesting to consider here: For most systems, an unreserved character appearing within a URI component is interpreted as representing the data octet corresponding to that character's encoding in US-ASCII. Consumers of URIs assume that the letter "X" corresponds to the octet "01011000", and even when that assumption is incorrect, there is no harm in making it. A system that internally provides identifiers in the form of a different character encoding, such as EBCDIC, will generally perform character translation of textual identifiers to UTF-8 [STD63] (or some other superset of the US-ASCII character encoding) at an internal interface, thereby providing more meaningful identifiers than those resulting from simply percent-encoding the original octets. I am really not sure how to interpret this. I have been reading '%' in productions as '0b00100101' and I have written my parser this way; but that is probably backwards thinking.
...let's say we have the path segments ["foo", "bar/baz"] and we wish to use them in the path info of a URI. Because / is a special character it must be percent encoded as %2F. So, the path info for the url would be:
foo/bar%2Fbaz
If we had the path segments, ["foo","bar","baz"], however that would be encoded as:
foo/bar/baz
Now let's look at decoding the path. If we simple decode the percent encoded characters and give the user a ByteString then both urls will decode to:
pack "foo/bar/baz"
Which is incorrect. ["foo", "bar/baz"] and ["foo","bar","baz"] represent different paths. The percent encoding there is required to distinguish between to two unique paths.
I read the section on paths differently: a path is sequence of bytes, wherein slash runs are not permitted, among other rules. However, re-reading the section, a big todo is made about hierarchical data and path normalization; it really seems your interpretation is the correct one. I tried it out in cURL, for example: http://www.ietf.org/rfc%2Frfc3986.txt # 404 Not Found http://www.ietf.org/rfc/rfc3986.txt # 200 OK My recently released released URL parser/pretty-printer is actually wrong in its handling of paths and, when corrected, will only amount to a parser of URLs that are encoded in US-ASCII and supersets thereof. -- Jason Dusek pgp // solidsnack // C1EBC57DC55144F35460C8DF1FD4C6C1FED18A2B

2012/3/12 Jeremy Shaw
On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek
wrote: Well, to quote one example from RFC 3986:
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
Right. This describes how to convert an octet into a sequence of characters, since the only thing that can appear in a URI is sequences of characters.
The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
Not sure what you mean by this. As the RFC says, a URI is defined entirely by the identity of the characters that are used. There is definitely no single, correct byte sequence for representing a URI. If I give you a sequence of bytes and tell you it is a URI, the only way to decode it is to first know what encoding the byte sequence represents.. ascii, utf-16, etc. Once you have decoded the byte sequence into a sequence of characters, only then can you parse the URI.
Mr. Shaw, Thanks for taking the time to explain all this. It's really helped me to understand a lot of parts of the URI spec a lot better. I have deprecated my module in the latest release http://hackage.haskell.org/package/URLb-0.0.1 because a URL parser working on bytes instead of characters stands out to me now as a confused idea. -- Jason Dusek pgp /// solidsnack 1FD4C6C1 FED18A2B

Hi, I only just noticed this discussion. Essentially, I think you have arrived at the right conclusion regarding URIs. For more background, the IRI document makes interesting reading in this context: http://tools.ietf.org/html/rfc3987; esp. sections 2, 2.1. The IRI is defined in terms of Unicode characters, which themselves may be described/referenced in terms of their code points, but the character encoding is not prescribed. In practice, I think systems are increasingly using UTF-8 for transmitting IRIs and URIs, and using either UTF-8 or UTF-16 for internal storage. There is still a legacy of ISO-8859-1 being defined asthe default charset for HTML (cf. http://www.w3.org/International/O-HTTP-charset for further discussiomn). #g -- On 14/03/2012 06:43, Jason Dusek wrote:
2012/3/12 Jeremy Shaw
: On Sun, Mar 11, 2012 at 1:33 PM, Jason Dusek
wrote: Well, to quote one example from RFC 3986:
2.1. Percent-Encoding
A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component.
Right. This describes how to convert an octet into a sequence of characters, since the only thing that can appear in a URI is sequences of characters.
The syntax of URIs is a mechanism for describing data octets, not Unicode code points. It is at variance to describe URIs in terms of Unicode code points.
Not sure what you mean by this. As the RFC says, a URI is defined entirely by the identity of the characters that are used. There is definitely no single, correct byte sequence for representing a URI. If I give you a sequence of bytes and tell you it is a URI, the only way to decode it is to first know what encoding the byte sequence represents.. ascii, utf-16, etc. Once you have decoded the byte sequence into a sequence of characters, only then can you parse the URI.
Mr. Shaw,
Thanks for taking the time to explain all this. It's really helped me to understand a lot of parts of the URI spec a lot better. I have deprecated my module in the latest release
http://hackage.haskell.org/package/URLb-0.0.1
because a URL parser working on bytes instead of characters stands out to me now as a confused idea.
-- Jason Dusek pgp /// solidsnack 1FD4C6C1 FED18A2B
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Jason Dusek wrote:
:info System.Posix.Env.getEnvironment System.Posix.Env.getEnvironment :: IO [(String, String)] -- Defined in System.Posix.Env
But there is no law that environment variables must be made of characters:
The recent ghc release provides System.Posix.Env.ByteString.getEnvironment :: IO [(ByteString, ByteString)] -- see shy jo
participants (5)
-
Brandon Allbery
-
Graham Klyne
-
Jason Dusek
-
Jeremy Shaw
-
Joey Hess