Re: [web-devel] Type-safe URL handling

21 Mar 2010

      ...
ok, here is what I have found out so far. First, I tested 3 html generation
libraries to see if they do any escaping on the arguments passed to href
(Text.Html, Text.XHtml, and HSP):
{-# OPTIONS -F -pgmFtrhsx #-}
module Main where
import System.IO
import qualified Text.Html as H
import qualified Text.XHtml as X
import HSP
import HSP.Identity
import HSP.HTML
main :: IO ()
main =
  do hSetEncoding stdout utf8
     let nihongo = "日本語"
     putStrLn nihongo
     putStrLn $ H.renderHtml $ H.anchor H.! [H.href nihongo] H.<< (H.toHtml
"nihongo")
     putStrLn $ X.renderHtml $ X.anchor X.! [X.href nihongo] X.<< (X.toHtml
"nihongo")
     putStrLn $ renderAsHTML $ evalIdentity $ <a href=nihongo>nihongo</a>
The output produced was:
*Main Text.Html System.IO> main
日本語
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 FINAL//EN">
<!--Rendered using the Haskell Html Library v0.2-->

...

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
http://www.w3.org/1999/xhtml"
...

...
nihongo
So, none of them attempted to convert the String into a valid URL.  The
XHtml library did make an attempt to encode the string, but that encoding
does not really make it a valid URL. (And the other two utf-8 encoded the
string, because they utf-8 encoded the whole document -- which is the
correct thing to do).
The behavior of these libraries seems correct -- if they attempted to do
more url encoding,  I think that would just make things worse.
Next there is the question of what are you supposed to do with non-ASCII
characters in a URI? This is describe in section 2.1 of RFC 2396:
http://www.ietf.org/rfc/rfc2396.txt
The relationship between URI and characters has been a source of
   confusion for characters that are not part of US-ASCII. To describe
   the relationship, it is useful to distinguish between a "character"
   (as a distinguishable semantic entity) and an "octet" (an 8-bit
   byte). There are two mappings, one from URI characters to octets, and
   a second from octets to original characters:
URI character sequence->octet sequence->original character sequence
A URI is represented as a sequence of characters, not as a sequence
   of octets. That is because URI might be "transported" by means that
   are not through a computer network, e.g., printed on paper, read over
   the radio, etc.
So a URI is a character sequence (of a restricted set of characters that
are found in ASCII). A URI does not have a 'binary representation', because
it could be transmitted via non-binary forms, such as a business card, etc.
It is the characters that matter. A uri that has been utf-8 encoded and
utf-16 encoded is still the same uri because the characters represented by
those encodings are the same.
So, there is actually another little piece missing in that sequence when
data is transmitted via the computer. Namely, extracting the URI from the
raw octets.
raw octets for uri -> URI character sequence -> octet sequence -> original
character sequence
For example, let's pretend a web page was sent as: Content-type: text/html;
charset=utf-32
The utf-32 octets representing the uri must first be decoded to characters
(aka the uri character sequence). That seems outside the scope of URLT ..
that stage of decoding should be done before URLT gets the data because it
requires looking at HTTP headers, the meta-equiv tag, etc. Next we can
convert the uri sequence into a new sequence of octets representing 8-bit
encoded data. That is done by converting normal ascii characters to their
8-bit ascii equivalent, and by converting % encoded values to their
equivalent 8-bit values. so the character 'a' in the URI would be converted
to 0x61, and the sequence %35 would be converted to 0x35. Next the binary
data is converted to the original character sequence.
There are a few things that make this tricky.
1. the encoding of the octet sequence in the middle is not specified in
the uri. So when you are converting back to the original character sequence
you don't know if octet sequence represents ascii, utf-8, or something else.
2. normalization and reserved characters
Every character *can* be percent encoded, though your are only supposed
to percent encode a limited set. URL normalization dictates that the
following three URIs are equivalent:
http://example.com:80/~smith/home.html
      http://EXAMPLE.com/%7Esmith/home.html
      http://EXAMPLE.com:/%7esmith/home.html
The %7E and ~ are equal, because ~ is *not* a reserved character. But
/foo/bar/baz/
   /foo%2Fbar/baz/
are *not* equal because / is a reserved character.
RFC3986 has this to say about when to encode and decode:
2.4.  When to Encode or Decode
Under normal circumstances, the only time when octets within a URI
   are percent-encoded is during the process of producing the URI from
   its component parts.  This is when an implementation determines which
   of the reserved characters are to be used as subcomponent delimiters
   and which can be safely used as data.  Once produced, a URI is always
   in its percent-encoded form.
When a URI is dereferenced, the components and subcomponents
   significant to the scheme-specific dereferencing process (if any)
   must be parsed and separated before the percent-encoded octets within
   those components can be safely decoded, as otherwise the data may be
   mistaken for component delimiters.  The only exception is for
   percent-encoded octets corresponding to characters in the unreserved
   set, which can be decoded at any time.  For example, the octet
   corresponding to the tilde ("~") character is often encoded as "%7E"
   by older URI processing implementations; the "%7E" can be replaced by
   "~" without changing its interpretation.
Because the percent ("%") character serves as the indicator for
   percent-encoded octets, it must be percent-encoded as "%25" for that
   octet to be used as data within a URI.  Implementations must not
   percent-encode or decode the same string more than once, as decoding
   an already decoded string might lead to misinterpreting a percent
   data octet as the beginning of a percent-encoding, or vice versa in
   the case of percent-encoding an already percent-encoded string.
It also has this to say about encoding Unicode data:
When a new URI scheme defines a component that represents textual
   data consisting of characters from the Universal Character Set [UCS],
   the data should first be encoded as octets according to the UTF-8
   character encoding [STD63]; then only those octets that do not
   correspond to characters in the unreserved set should be percent-
   encoded.  For example, the character A would be represented as "A",
   the character LATIN CAPITAL LETTER A WITH GRAVE would be represented
   as "%C3%80", and the character KATAKANA LETTER A would be represented
   as "%E3%82%A2".
I can't find an official stamp of approval, but I believe the http scheme
now specifies that the octets in the middle step are utf-8 encoded.
So, here is a starting example of what I think should happen for encoding,
and then decoding.
1. We start with a list of path components ["foo/bar","baz"]
2. We then convert the sequence to a String containing the utf-8 encoded
octets (a String not a bytestring)
3. We percent encode everything that is not an unreserved character
4. We add the delimiters
We now have a proper URI. Note that we have a String and that the URI is
made up of the characters in that String. The final step happens when the
URI is actually used:
5. the URI is inserted into an HTML document (etc). The document is this
encoded according to whatever encoding the document is supposed to have
(could be anything), converting the URI into some encoding.
So a URI is actually encoded twice. We use a similar process to decode the
URI. Here is some code that does what I described:
import Codec.Binary.UTF8.String (encodeString, decodeString)
import Network.URI
import System.FilePath.Posix (joinPath, splitDirectories)
encodeUrl :: [String] -> String
encodeUrl paths =
  let step1 = map encodeString paths -- utf-8 encode the data characters in
path components (we have not added any delimiters yet)
      step2 = map (escapeURIString isUnreserved) step1 -- percent encode
the characters
      step3 = joinPath step2 -- add in the delimiters
  in step3
decodeUrl :: String -> [String]
decodeUrl str =
  let step1 = splitDirectories str            -- split path on delimiters
      step2 = map unEscapeString step1 -- decode any percent encoded
characters
      step3 = map decodeString step2   -- decode octets
  in step3
f = encodeString "日本語"
test =
  let p = ["foo/bar", "日本語"]
      e = encodeUrl p
      d = decodeUrl e
  in (d == p, p, e ,d)
The problem with using [String] is that it assumes the only delimiter we
care about is '/'. But we might also want to deal with the other delimiters
such as : # ?. (For example, if we want to use the urlt system to generate
the query string as well as the path..). But [String] does not give us a way
to do that. Instead it seems like we would need a type that would allow us
to specify the path, the query string, the fragment, etc. namely a real uri
type? Perhaps there is something on hackage we can leverage.
I think that having each individual set of toUrl / fromUrl functions deal
with the encoding / decoding is not a good way to go. Makes it too easy to
get it wrong. Having it all done correctly in one place makes life easier
for people adding new instances or methods of generating instances.
I think that urlt dealing with ByteString or [ByteString] is never the
right thing. The only time that the URI is a 'byte string' is when it is
encoded in an html document, or encoded in the http headers. But at the URLT
level, we don't know what encoding that is. Instead we want the bytestring
decoded, and we want to receive a 'URI character sequence.' Or we want to
give a 'URI character sequence' to a the html library, and let it worry
about the encoding of the document.
At present, I think I am still ok with the fromURL and toURL functions
producing and consuming String values. But, what we need is an intermediate
URL type like:
data URL = URL { paths :: [String], queryString :: String :: frag :: String
}
and functions that properly do, encodeURL :: URL -> String, decodeURL ::
String -> URL.
The AsURL class would look like:
class AsURL u where
  toURLC :: u -> URL
  fromURLC :: URL -> Failing u
instance AsURL URL where
  toURLC = id
  fromURLC = Success
And then toURL / fromURL would be like:
toURL :: (AsURL u) => u -> String
toURL = encodeURL . toURLC
fromURL :: (AsURL u) => String -> u
fromURL = fromURLC . decodeURL
The Strings in the URL type would not require any special
encoding/decoding. The encoding / decoding would be handled by the encodeURL
/ decodeURL functions.
In other words, when the user creates a URL type by hand, they do not have
to know anything about url encoding rules, it just happens like magic. That
should make it much easier to write AsURL instances by hand.
Does this makes sense to you?
The key now is seeing if someone has already create a suitable URL type
that we can use...
That made perfect sense, thank you for doing such thorough research on
2010/3/20 Jeremy Shaw 

this.

I've attached two files; test1.html is UTF-8 encoded, test3.html is
windows-1255 (Hebrew). On my system, both links point to the same location,
implying to me that you are spot on that UTF-8 should always be used for
URLs. I had made a mistake with my test on Friday; apparently we only have
the encoding issue with the query string.

Now, back to your point: I'm not sure why you want to include the query
string and fragment as part of the URL. Regarding the fragment: it will
never be passed to the server, so it's *impossible* to consider it for
parsing URLs. I understand that you might want to generate URLs with a
fragment, but we would then need to have parse and render functions which do
not parallel each other properly.

Regarding the query string, I can see more of an argument being made to
include it, but it feels wrong to me. Precedence in most places does not
allow you to route requests based on the query string, and this seems like a
Good Idea. I know it would be nice to be guaranteed that there is a certain
GET parameter present, but I really think this should be dealt with at the
handler level.

In other words, I like the idea of using [String]. My only argument in favor
of [ByteString] instead was the character encoding issue, and you've
convinced me that it's a non-issue.

If we can agree on this, I don't see a necessity to rely on an external
package to provide the URL datatype (since we would just be using [String]).
I can provide the encodeURL/decodeURL functions in web-encodings if that's
acceptable- your implementation seems correct to me. However, since it does
not function on fully-qualified URLs, perhaps we should call it
encodePathInfo/decodePathInfo?

Michael

Re: [web-devel] Type-safe URL handling

Michael Snoyman