Re: behaviour change in getDirectoryContents in GHC 7.2?

9 Nov 2011

      On 09/11/2011 10:39, Max Bolingbroke wrote:
...
On 8 November 2011 11:43, Simon Marlow  wrote:
...
Don't you mean 1 is what we have?
Yes, sorry!
...
Failing to roundtrip in some cases, and doing so silently, seems highly
suboptimal to me.  I'm sorry I didn't pick up on this at the time (Unicode
is a swamp :).
I *can* change the implementation back to using lone surrogates. This
gives us guaranteed roundtripping but it means that the user might see
lone-surrogate Char values in Strings from the filesystem/command
line. IIRC this does break some software -- e.g. Brian's "text"
library explicitly checks for such characters and fails if it detects
them.
So whatever happens we are going to end up making some group of users unhappy!
   * No PEP383: Haskellers using non-ASCII get upset when their command
line argument [String]s aren't in fact sequences of characters, but
sequences of bytes in some arbitrary encoding
   * PEP383(surrogates): Unicoders get upset by lone surrogates (which
can actually occur at the moment, independent of PEP383 -- e.g. as
character literals or from FFI)
   * PEP383(private chars): Unixers get upset that we can't roundtrip
byte sequences that look like the codepoint 0xEFXX encoded in the
current locale. In practice, 0xEFXX is only decodable from a UTF
encoding, so we fail to roundtrip byte sequences like the one Ian
posted.
I'm happy to implement any behaviour, I would just like to know that
whatever it is is accepted as the correct tradeoff :-)
I would be happy with the surrogate approach I think.  Arguable if you 
try to treat a string with lone surrogates as Unicode and it fails, then 
that is a feature: the original string wasn't Unicode.  All you can do 
with an invalid Unicode string is use it as a FilePath again, and the 
right thing will happen.

Alternatively if we stick with the private char approach, it should be 
possible to have an escaping scheme for 0xEFxx characters in the input 
that would enable us to roundtrip correctly.  That is, escape 0xEFxx 
into a sequence 0xYYEF 0xYYxx for some suitable YY.  But perhaps that 
would be too expensive - an extra translation pass over the buffer after 
iconv (well, we do this for newline translation, so maybe it's not too bad).
...
RE exposing a ByteString based interface to the IO library from
base/unix/whatever: AFAIK Python doesn't do this, and just tells
people to use the (x.encode(sys.getfilesystemencoding(),
"surrogateescape")) escape hatch, which is what I've been
recommending. I think this would be more satisfying to John if it were
actually guaranteed to work on arbitrary byte sequences, not just
*highly likely* to work :-)
The performance overhead of all this worries me.  withCString has taken 
a huge performance hit, and I think there are people who wnat to know 
that there aren't several complex encoding/decoding passes between their 
Haskell code and the POSIX API.  We ought to be able to program to POSIX 
directly, and the same goes for Win32.

Cheers,
	Simon