Re: Adding Network.URI.escape

4 Jan 2010

      On Fri, Dec 25, 2009 at 4:17 PM, Graham Klyne <GK-lists@ninebynine.org> wrote:
...
Gwern Branwen wrote:
...
Network.URI.escapeURIString is pretty much always used to make a
String a URL or a part of a URL.
The existing definition
http://www.haskell.org/ghc/docs/6.10.4/html/libraries/network/Network-URI.ht...
forces one to do extra work by having to specify a `Char -> Bool`.
More than a few packages & libraries simply define an 'escape'
function `escapeURIString isAllowedInURI` (either inline or as a named
function). This sort of repetition is unfortunate.
Hmmm... I think that's not strictly correct - it should be 'escapeURIString
isUnescapedInURI'.  The form used above would leave literal '%' characters
unescaped.
That's unfortunate! But it also takes care of a long-niggling worry -
I had come across an old #haskell log where someone said that that
definition is wrong, but they didn't explain how. I guess I ought to
go around to every user of that definition, like Gitit, and correct
them...
...
...
The name 'escape' is commonly used to express exactly that
functionality: http://holumbus.fh-wedel.de/hayoo/hayoo.html#0:escape
What would people say to adding such a function?
The reason that the 'escapeURIString' always takes the Char -> Bool function
is that the rules for escaping can very between URI schemes, and between
components within a single URI.  For example, a literal '/' or '?' appearing
within a path segment in an http: URI would need to be escaped, but that's
not included by the common case of 'escapeURIString isUnescapedInURI'.
The 'isAllowedInURI' function, IIRC, is a kind of least-common-denominator
function that causes non-URI characters to be escaped so that the resulting
string is at least syntactically valid according to RFC3986.  But in some
cases (i.e. for some schemes) this may not be enough - see RFC 3986, section
2.1 ("A percent-encoding mechanism is used to represent a data octet in a
component when that octet's corresponding character is outside the allowed
set or is being used as a delimiter of, or within, the component" --
http://www.apps.ietf.org/rfc/rfc3986.html#sec-2.1 ); see also section 2.4.
So, while one could define an additional function as you suggest, I'm not
sure it is necessarily wise, because having the explicit function to
designate characters to be escaped does at least draw attention to exactly
which characters would be escaped in the context of use.  But OTOH, if
implementations tend to use 'escapeURIString isAllowedInURI' as you say,
maybe this just creates an opportunity for additional errors.
URI escaping is, to some extent, a necessarily messy and error-prone
business - it's really hard to define a generic escaping mechanism that
neatly covers all eventualities, because of the multiple stages of
interpretation that can take place when actually using a URI.
#g
Thanks for the information; I start to see what you mean by the
difficulty. But as you say, while a 'escape' may be dangerous, it's
not like people are being safe now without an 'escape'.

Is it possible to identify the most common escaping scenarios and come
up with the correct shortcuts?

For example, perhaps we could defined an 'escapeURL = escapeURIString
isUnescapedInURI' which is suitable for garden-variety tasks like
`"http://gitit.net"++escapeURL pagename`, and then another for the
octets you mention ('escapeOctet'?).

-- 
gwern