getting crazy with character encoding

Andrea Rossato

12 Sep 2007 12 Sep '07

2:18 p.m.

Hi, supposed that, in a Linux system, in an utf-8 locale, you create a file with non ascii characters. For instance: touch abèèè Now, I would expect that the output of a shell command such as "ls ab*" would be a string/list of 5 chars. Instead I find it to be a list of 8 chars...;-) That is to say, each non ascii character is read as 2 characters, as if the string were an ISO-8859-1 string - the string is actually treated as an ISO-8859-1 string. But when I print it, now it is displayed correctly. I don't understand what's wrong and, this is worse, I don't understand what I should be studying to understand what I'm doing wrong. After reading about character encoding, the way the linux kernel manages file names, I would expect that a file name set in an utf-8 locale should be read by locale aware application as an utf-8 string, and each character a unicode code point which can be represented by a Haskell char. What's wrong with that? Thanks for your kind attention. Andrea Here the code to test my problem. Before creating the file remember to set the LANG environmental variable. Something like: export LANG="en_US.utf8" should be fine. (Check your available locales with "locale -a") import System.Process import System.IO import Control.Monad main = do l <- fmap lines $ runProcessWithInput "/bin/bash" [] "ls ab*" putStrLn (show l) mapM_ putStrLn l mapM_ (putStrLn . show . length) l runProcessWithInput cmd args input = do (pin, pout, perr, ph) <- runInteractiveProcess cmd args Nothing Nothing hPutStr pin input hClose pin output <- hGetContents pout when (output==output) $ return () hClose pout hClose perr waitForProcess ph return output

Show replies by date

Stephane Bortzmeyer

12 Sep 12 Sep

2:35 p.m.

On Wed, Sep 12, 2007 at 04:18:43PM +0200, Andrea Rossato wrote a message of 60 lines which said:

...

Now, I would expect that the output of a shell command such as "ls ab*" would be a string/list of 5 chars.

I do not think this expectation is reasonable. I do not think that ls is Unicode-aware. It probably has only bytes semantic, not characters semantic.

...

I would expect that a file name set in an utf-8 locale should be read by locale aware application

"locale aware application" is too vague. An application can use the locale and still being unable to separate bytes from characters. ls may be "locale aware" but it is probably not "Unicode aware".

...

l <- fmap lines $ runProcessWithInput "/bin/bash" [] "ls ab*"

This is not an Haskell issue but a ls issue. use System.Directory.getDirectoryContents and we'll see.

Andrea Rossato

3 p.m.

On Wed, Sep 12, 2007 at 04:35:50PM +0200, Stephane Bortzmeyer wrote:

...

This is not an Haskell issue but a ls issue. use System.Directory.getDirectoryContents and we'll see.

I get the very same output. Thanks for you kind attention. Andrea

Brandon S. Allbery KF8NH

2:53 p.m.

On Sep 12, 2007, at 10:18 , Andrea Rossato wrote:

...

supposed that, in a Linux system, in an utf-8 locale, you create a file with non ascii characters. For instance: touch abèèè

Now, I would expect that the output of a shell command such as "ls ab*" would be a string/list of 5 chars. Instead I find it to be a list of 8 chars...;-)

That is expected. The low level filesystem storage doesn't know about character sets, so non-ASCII filenames must be encoded in e.g. UTF-8. 8 characters is therefore correct, and you must do UTF-8 decoding on input because Haskell does not do so automatically. This will also be true with getdirent() aka getDirectoryContents. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Andrea Rossato

3:19 p.m.

On Wed, Sep 12, 2007 at 10:53:29AM -0400, Brandon S. Allbery KF8NH wrote:

...

That is expected. The low level filesystem storage doesn't know about character sets, so non-ASCII filenames must be encoded in e.g. UTF-8. 8 characters is therefore correct, and you must do UTF-8 decoding on input because Haskell does not do so automatically.

Ahh, now I eventually get it! So, as far as I understand, I'm getting bytes that are automatically translated into an iso-8859-1 string, if I'm correctly reading this old post by Glynn: http://tinyurl.com/2fhl43 And so it's my job to convert it in what I need. Luckily I've just discovered (and now I'm reading) some of John Meacham's code on locale. This is going to be very helpful (unfortunately I don't see Licenses coming with HsLocale, but if I'm reading correctly there is something like this in Riot - and this was BSD3 released). Thanks for your kind attention. Andrea

John Meacham

14 Sep 14 Sep

12:47 a.m.

On Wed, Sep 12, 2007 at 05:19:22PM +0200, Andrea Rossato wrote:

...

And so it's my job to convert it in what I need. Luckily I've just discovered (and now I'm reading) some of John Meacham's code on locale. This is going to be very helpful (unfortunately I don't see Licenses coming with HsLocale, but if I'm reading correctly there is something like this in Riot - and this was BSD3 released).

it is BSD3. in general, pretty much everything I write is BSD3 except for large projects as a whole which get GPL>=2. Though I am more than happy to BSD3 any incidentally useful parts of my projects that others would find useful. John -- John Meacham - ⑆repetae.net⑆john⑈

Seth Gordon

12 Sep 12 Sep

3:16 p.m.

Andrea Rossato wrote:

...

Hi,

supposed that, in a Linux system, in an utf-8 locale, you create a file with non ascii characters. For instance: touch abèèè

Now, I would expect that the output of a shell command such as "ls ab*" would be a string/list of 5 chars. Instead I find it to be a list of 8 chars...;-)

The file name may have five *characters*, but if it's encoded as UTF-8, then it has eight *bytes*. It appears that in spite of the locale definition, hGetContents is treating each byte as a separate character without translating the multi-byte sequences *from* UTF-8, and then putStrLn sends each of those bytes to standard output without translating the non-ASCII characters *to* UTF-8. So the second line of your program's output is correct...but only by accident. Futzing around a little bit in ghci, I see that I can define a string "\1488", but if I send that string to putStrLn, I get nothing, when I should get א (the Hebrew letter aleph). I � Unicode.

Dougal Stanton

3:24 p.m.

On 12/09/2007, Seth Gordon wrote:

...

I � Unicode.

Was it intentional that the central character appears as a little '?', even though the aleph on the line above worked? Either way it would be very amusing, but for different reasons... D

Seth Gordon

3:43 p.m.

Dougal Stanton wrote:

...

On 12/09/2007, Seth Gordon wrote:

...
I � Unicode.

Was it intentional that the central character appears as a little '?', even though the aleph on the line above worked?

It was intentional. If I ♡ed Unicode, I would have said so.

Andrea Rossato

3:26 p.m.

On Wed, Sep 12, 2007 at 11:16:25AM -0400, Seth Gordon wrote:

...

It appears that in spite of the locale definition, hGetContents is treating each byte as a separate character without translating the multi-byte sequences *from* UTF-8, and then putStrLn sends each of those bytes to standard output without translating the non-ASCII characters *to* UTF-8. So the second line of your program's output is correct...but only by accident.

that's it indeed. As I said in the message I've just sent, I've read that the String/CString conversion is automatically done in ISO-8859-1, so "èèè", which are 6 bytes in utf-8, are translated into 6 iso-8859-1 characters. What puzzles me is the behavior of putStrLn. Thanks for your time. Andrea

Seth Gordon

3:40 p.m.

Andrea Rossato wrote:

...

What puzzles me is the behavior of putStrLn.

putStrLn is sending the following bytes to standard output: 97, 98, 195, 168, 195, 168, 195, 168, 10 Since the code that renders characters in your terminal emulator is expecting UTF-8[*], each (195, 168) pair of bytes is rendered as "è". The Unix utility "od" can be very helpful in figuring out problems like this. [*]At least on my computer, I get the same result *even if* I change "LANG" from "en_US.utf8" to "C".

Andrea Rossato

5:08 p.m.

On Wed, Sep 12, 2007 at 11:40:11AM -0400, Seth Gordon wrote:

...

The Unix utility "od" can be very helpful in figuring out problems like this.

Thanks for pointing me to "od", I didn't know it.

...

[*]At least on my computer, I get the same result *even if* I change "LANG" from "en_US.utf8" to "C".

As far as I understand it is the terminal emulator responsible for translating the bytes to characters. If I run it in a console I get abA"A"A" (sort of) no matter what my LANG is - 8 single 8 -bit characters. Cheers, Andrea

David Benbennick

6:30 p.m.

On 9/12/07, Andrea Rossato wrote:

...

If I run it in a console I get abA"A"A" (sort of) no matter what my LANG is - 8 single 8 -bit characters.

It's possible to set your Linux console to grok UTF8. I don't remember the details, but I'm sure you can Google for it. By the way, does anyone know The Right Way to deal with UTF-8 in Haskell? I.e., take that 8 byte UTF-8 string and convert it to a 5 character Unicode string (so it can be manipulated)?

Jules Bean

7:26 p.m.

David Benbennick wrote:

...

On 9/12/07, Andrea Rossato wrote:

...
If I run it in a console I get abA"A"A" (sort of) no matter what my LANG is - 8 single 8 -bit characters.

It's possible to set your Linux console to grok UTF8. I don't remember the details, but I'm sure you can Google for it.

By the way, does anyone know The Right Way to deal with UTF-8 in Haskell? I.e., take that 8 byte UTF-8 string and convert it to a 5 character Unicode string (so it can be manipulated)?

There is no UTF8 decode support in the standard libraries. There are some contributed libraries which can do it. Data.CompactString is one. Jules

Don Stewart

7:51 p.m.

mailing_list:

...

On Wed, Sep 12, 2007 at 11:16:25AM -0400, Seth Gordon wrote:

...
It appears that in spite of the locale definition, hGetContents is treating each byte as a separate character without translating the multi-byte sequences *from* UTF-8, and then putStrLn sends each of those bytes to standard output without translating the non-ASCII characters *to* UTF-8. So the second line of your program's output is correct...but only by accident.

that's it indeed. As I said in the message I've just sent, I've read that the String/CString conversion is automatically done in ISO-8859-1, so "èèè", which are 6 bytes in utf-8, are translated into 6 iso-8859-1 characters.

What puzzles me is the behavior of putStrLn.

Thanks for your time.

Have you tried the utf8-string conversion library? http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string-0.1 -- Don

Aaron Denney

13 Sep 13 Sep

12:23 a.m.

Hi. I believe that everything I've said has been said by another responder, but not all together in one place. On 2007-09-12, Andrea Rossato wrote:

...

supposed that, in a Linux system, in an utf-8 locale, you create a file with non ascii characters. For instance: touch abÃ¨Ã¨Ã¨

Now, I would expect that the output of a shell command such as "ls ab*" would be a string/list of 5 chars. Instead I find it to be a list of 8 chars...;-)

That is to say, each non ascii character is read as 2 characters, as if the string were an ISO-8859-1 string - the string is actually treated as an ISO-8859-1 string. But when I print it, now it is displayed correctly.

The Linux kernel doesn't really have a notion of characters, only bytes in its interfaces. (This isn't strictly true: it needs to in some cases when it's interacting with other systems, but it's 99% true.) In the UTF-8 representation of these 5 characters are 8 bytes, as indeed each non-ASCII character takes two bytes. The various C runtimes do have some notion of various character sets, and locales, and so forth, and build on top of the byte interface to represent characters. But not all programs use these. Your example of ls just takes the bytes from the kernel, and perhaps does some minimal sanitizing (munging control codes) before sending them to the tty. If the terminal understands UTF-8, everything works great. On the other hand, GHC's runtime always interprets these bytes as meaning the characters in ISO-8859-1 (this just takes the bytes to the unicode code points), and does not pay attention to locale settings such as LC_CHARSET, etc. While this has some nice properties (totally invertible, no code to maintain (as the first 256 code points of Unicode are ISO-8859-1), etc.), personally, I think this is a bug. The Haskell standard talks about characters, not bytes, and the characters read and written should correspond to the native environment notions and encodings. These are, under Unix, determined by the locale system. Unfortunately, at this point it is a well entrenched bug, and changing the behaviour will undoubtedly break programs. There should be another system for getting the exact bytes in and out (as Word8s, say, rather than Chars), and there are in fact external libraries using lower level interfaces, rather than the things like putStr, getLine, etc. that do this. An external library works, of course, but it should be part of the standard so implementors know that character based routines actually are character based, not byte based.

...

After reading about character encoding, the way the linux kernel manages file names, I would expect that a file name set in an utf-8 locale should be read by locale aware application as an utf-8 string, and each character a unicode code point which can be represented by a Haskell char. What's wrong with that?

That's a reasonable assumption. The problem is that GHC doesn't support locales. But byte-sequences do round-trip, as long as you don't try to process them, so not as much breaks as one might think. I don't know what NHC and hugs do, though I assume they also provide no translations. I'm also not sure what JHC does, though I do see mentions of UTF-8, UTF-16 (for windows), and UTF-32 (for internal usage of C libraries), and I do know that John is fairly careful about locale issues. -- Aaron Denney -><-

Stefan O'Rear

12:40 a.m.

On Thu, Sep 13, 2007 at 12:23:33AM +0000, Aaron Denney wrote:

...

Unfortunately, at this point it is a well entrenched bug, and changing the behaviour will undoubtedly break programs. ... There should be another system for getting the exact bytes in and out (as Word8s, say, rather than Chars), and there are in fact external libraries using lower level interfaces, rather than the things like putStr, getLine, etc. that do this. An external library works, of course, but it should be part of the standard so implementors know that character based routines actually are character based, not byte based. ... I don't know what NHC and hugs do, though I assume they also provide no translations. I'm also not sure what JHC does, though I do see mentions of UTF-8, UTF-16 (for windows), and UTF-32 (for internal usage of C libraries), and I do know that John is fairly careful about locale issues.

I'm pretty sure Hugs does the right thing. NHC is probably broken. In any case, we already have hGetBuf / hPutBuf in the standard base libaries for raw binary IO, so code that uses getChar for bytes really has no excuse. We can and should fix the bug. Stefan

Bulat Ziganshin

9:13 a.m.

New subject: Re[2]: Re: getting crazy with character encoding

Hello Stefan, Thursday, September 13, 2007, 4:40:17 AM, you wrote:

...

I'm pretty sure Hugs does the right thing. NHC is probably broken. In any case, we already have hGetBuf / hPutBuf in the standard base libaries for raw binary IO, so code that uses getChar for bytes really has no excuse. We can and should fix the bug.

are you ever heard about backward compatibility? :/ -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Ketil Malde

10:06 a.m.

On Wed, 2007-09-12 at 17:40 -0700, Stefan O'Rear wrote:

...

On Thu, Sep 13, 2007 at 12:23:33AM +0000, Aaron Denney wrote:

...
Unfortunately, at this point it is a well entrenched bug, and changing the behaviour will undoubtedly break programs. ... There should be another system for getting the exact bytes in and out (as Word8s, say, rather than Chars),

...

I'm pretty sure Hugs does the right thing.

..which makes me wonder what the right thing actually is? Since IO on Unix (or at least on Linux) consists of bytes, I don't see how a Unicode-only interface is ever going to do the 'right thing' for all people. One possible solution might be to have IO functions deal with [Word8] instead of [Char]. If string and character constants were polymorphic, Char and String made aliases for byte-based types, and a new type introduced for Unicode characters, it might even be possible to fix without breaking absolutely all legacy code. But even this would probably only fix the Unix side of things. -k

Stefan O'Rear

1:49 p.m.

On Thu, Sep 13, 2007 at 12:06:15PM +0200, Ketil Malde wrote:

...

On Wed, 2007-09-12 at 17:40 -0700, Stefan O'Rear wrote:

...
On Thu, Sep 13, 2007 at 12:23:33AM +0000, Aaron Denney wrote:

...
Unfortunately, at this point it is a well entrenched bug, and changing the behaviour will undoubtedly break programs. ... There should be another system for getting the exact bytes in and out (as Word8s, say, rather than Chars),

...
I'm pretty sure Hugs does the right thing.

..which makes me wonder what the right thing actually is?

Since IO on Unix (or at least on Linux) consists of bytes, I don't see how a Unicode-only interface is ever going to do the 'right thing' for all people.

I never said it was Unicode-only. hGetBuf / hPutBuf - Raw Word8 access getChar etc - Uses locale info Stefan

David Roundy

2:39 p.m.

On Thu, Sep 13, 2007 at 06:49:59AM -0700, Stefan O'Rear wrote:

...

On Thu, Sep 13, 2007 at 12:06:15PM +0200, Ketil Malde wrote:

...
On Wed, 2007-09-12 at 17:40 -0700, Stefan O'Rear wrote:

...
On Thu, Sep 13, 2007 at 12:23:33AM +0000, Aaron Denney wrote:

...
Unfortunately, at this point it is a well entrenched bug, and changing the behaviour will undoubtedly break programs. ... There should be another system for getting the exact bytes in and out (as Word8s, say, rather than Chars),

...
I'm pretty sure Hugs does the right thing.

..which makes me wonder what the right thing actually is?

Since IO on Unix (or at least on Linux) consists of bytes, I don't see how a Unicode-only interface is ever going to do the 'right thing' for all people.

I never said it was Unicode-only.

hGetBuf / hPutBuf - Raw Word8 access getChar etc - Uses locale info

The problem is that the type of openFile and getArgs is wrong, so there's no "right" way to get a Handle (other than stdin) to read from in the first place, unless we're willing to allow the current weird behavior of treating a [Char] as [Word8]. -- David Roundy Department of Physics Oregon State University

Aaron Denney

14 Sep 14 Sep

1:39 a.m.

On 2007-09-13, Stefan O'Rear wrote:

...

In any case, we already have hGetBuf / hPutBuf in the standard base libaries for raw binary IO, so code that uses getChar for bytes really has no excuse.

Except, of course, that hGetBuf and hPutBuf are (a) allocating the memory for the buffers is a pain (does it require the FFI?) (b) are something of a pain to use, requiring explicitly managing what's valid in these buffers (though a wrapper only need be written once) (c) while in the "standard base libraries" are not in the report or library report. i.e. there's no guarantee that a conforming Haskell implementation will have them. It'd be silly for an implementation to not support them, of course, but... The ByteString library at least fixes (a) and (b). -- Aaron Denney -><-

Stephane Bortzmeyer

13 Sep 13 Sep

9:07 a.m.

On Thu, Sep 13, 2007 at 12:23:33AM +0000, Aaron Denney wrote a message of 76 lines which said:

...

the characters read and written should correspond to the native environment notions and encodings. These are, under Unix, determined by the locale system.

Locales, while fine for things like the language of the error messages or the format to use to display the time, are *not* a good solution for things like file names and file contents. Even on a single Unix machine (without networking), there are *several* users. Using the locale to find out the charset used for a file name won't work if these users use different locales. Same thing for file contents. The charset used must be marked in the file (XML...) or in the metadata, somehow. Otherwise, there is no way to exchange files or even to change the locale (if I switch from Latin1 to UTF-8, what do my files become?)

Andrea Rossato

9:40 a.m.

On Thu, Sep 13, 2007 at 11:07:03AM +0200, Stephane Bortzmeyer wrote:

...

On Thu, Sep 13, 2007 at 12:23:33AM +0000, Aaron Denney wrote a message of 76 lines which said:

...
the characters read and written should correspond to the native environment notions and encodings. These are, under Unix, determined by the locale system.

Locales, while fine for things like the language of the error messages or the format to use to display the time, are *not* a good solution for things like file names and file contents.

Even on a single Unix machine (without networking), there are *several* users. Using the locale to find out the charset used for a file name won't work if these users use different locales.

Yes indeed. And I find it a real mess. And I don't see any way out.

...

Same thing for file contents. The charset used must be marked in the file (XML...) or in the metadata, somehow. Otherwise, there is no way to exchange files or even to change the locale (if I switch from Latin1 to UTF-8, what do my files become?)

Ok, you are perfectly right, but we live in an imperfect world and we must come up with a solution. In my case I'm developing this prompt for xmonad and a Chinese user wants directory and file names to be correctly displayed. What else can I do but using locale technologies? This is something I don't know. The code below is not perfect but it works to some extent. Nonetheless, if you have 2 users using an iso-8859-1 locale the first and utf-8 one the second, non ascii characters in file names of the first users will produce invalid character sequences for the second users. The reverse will work, though. I'm still puzzled and still find the thread title appropriate. Thanks for your kind attention. Andrea The locale aware version of the previous code (needs hsc2hs) {-# OPTIONS -fglasgow-exts #-} import Prelude hiding (catch) import System.Process import System.IO import Control.Monad import System.Directory import Foreign import Foreign.C import Data.Char import Control.Exception runProcessWithInput cmd args input = do (pin, pout, perr, ph) <- runInteractiveProcess cmd args Nothing Nothing hPutStr pin input hClose pin output <- hGetContents pout when (output==output) $ return () hClose pout hClose perr waitForProcess ph return output main = do setupLocale l <- fmap lines $ runProcessWithInput "/bin/bash" [] "ls ab*\n" l' <- mapM fromLocale l l'' <- mapM toLocale l' putStrLn (show l') mapM_ putStrLn l'' mapM_ (putStrLn . show . length) l' -- This code comes from John Meacham's HsLocale -- http://repetae.net/john/repos/HsLocale/ toLocale :: String -> IO String toLocale s = catch (stringToBytes s >>= return . map (chr . fromIntegral)) (const $ return "invalid character sequence") fromLocale :: String -> IO String fromLocale s = bytesToString (map (fromIntegral . ord) s) `catch` \_ -> return "invalid character sequence" stringToBytes :: String -> IO [Word8] stringToBytes cs = (withIConv "" "UTF-32" $ \ic -> convertRaw ic cs) bytesToString :: [Word8] -> IO String bytesToString xs = (withIConv "UTF-32" "" $ \ic -> convertRaw ic xs) >>= return . f where f ('\65279':xs) = xs -- discard byte order marker f xs = xs newtype IConv = IConv (#type intptr_t) deriving(Num,Eq,Show) foreign import ccall unsafe "iconv.h iconv_open" iconv_open :: Ptr CChar -> Ptr CChar -> IO IConv foreign import ccall unsafe "iconv.h iconv_close" iconv_close :: IConv -> IO CInt foreign import ccall unsafe "iconv.h iconv" iconv :: IConv -> Ptr (Ptr CChar) -> Ptr CSize -> Ptr (Ptr CChar) -> Ptr CSize -> IO CInt withIConv :: String -> String -> (IConv -> IO a) -> IO a withIConv to from action = bracket open close action where close ic = throwErrnoIfMinus1_ "iconv_close" (iconv_close ic) open = throwErrnoIfMinus1 "iconv_open" iopen iopen = do withCAString to $ \to -> do withCAString from $ \from -> do iconv_open to from convertRaw :: (Storable a, Storable b) => IConv -> [a] -> IO [b] convertRaw ic xs = do with (fromIntegral $ sizeOf (head xs) * length xs) $ \inptrSz -> do withArray xs $ \arr -> do with (castPtr arr) $ \inptr -> do allocaBytes (1024) $ \outptr -> do with outptr $ \outptrptr -> do with 1024 $ \outptrSz -> do let outSz = fromIntegral $ sizeOf $ unsafePerformIO (peek outptr) let go = do ret <- iconv ic inptr inptrSz (castPtr outptrptr) outptrSz err <- getErrno case (ret,err) of (-1,_) | err == e2BIG -> do oz <- peek outptrSz x <- peekArray ((1024 - fromIntegral oz) `div` outSz) (castPtr outptr) poke outptrptr outptr poke outptrSz 1024 y <- go return $ x ++ y (-1,_) -> throwErrno "iconv" (_,_) -> do oz <- peek outptrSz peekArray ((1024 - fromIntegral oz) `div` outSz) outptr go #include foreign import ccall unsafe "locale.h setlocale" setlocale :: CInt -> CString -> IO CString setupLocale :: IO () setupLocale = withCString "" $ \s -> do setlocale (#const LC_ALL) s return ()

Aaron Denney

10:10 a.m.

On Thu, Sep 13, 2007 at 11:07:03AM +0200, Stephane Bortzmeyer wrote:

...

On Thu, Sep 13, 2007 at 12:23:33AM +0000, Aaron Denney wrote a message of 76 lines which said:

...
the characters read and written should correspond to the native environment notions and encodings. These are, under Unix, determined by the locale system.

Locales, while fine for things like the language of the error messages or the format to use to display the time, are *not* a good solution for things like file names and file contents.

I never claimed it was a good system, merely that it was the system. Yes, serious applications should use byte oriented I/O and explicitly manage character sets when necessary. STDIO in general and terminal interaction in particular should use the locale selected by the user.

...

Even on a single Unix machine (without networking), there are *several* users. Using the locale to find out the charset used for a file name won't work if these users use different locales.

Same thing for file contents. The charset used must be marked in the file (XML...) or in the metadata, somehow.

For file system and network access, the justification is a bit more clouded, but the interfaces there _should not_ be character interfaces. Character interfaces are _lies_; Word8s are what actually get passed, and trying to treat them as unicode characters with any fixed mapping breaks. At best we get an extremely leaky abstraction. Filesystems are not uniform across systems, yet Haskell tries to present a uniform view that manages to capture exactly no existing system. File contents (almost) everywhere are streams of bytes (ignoring, say, old record-based OSes, palm databases, and mac resource forks etc.) Almost all file systems use a hierarchical directory system, but with significant differences. Under unixes the names are NUL-terminated bytestrings that can't contain slashes. New Macs and Windows have specific character encodings (UTF-8, and UTF-16, respectively). DOS, old Macs, and windows have multiple roots and various directory seperators and forbidden characters. Trying to specify some API that is usable for robust programs that work on any of these is hard. I'd actually have preferred that the standard didn't even try, and instead provided system-specific annexes. Then an external library that was freer to evolve could try to solve the problem of providing a uniform interface that would not defy platform expectations. -- Aaron Denney -><-

6508

Age (days ago)

6510

Last active (days ago)

List overview

Download

24 comments

14 participants

participants (14)

Aaron Denney
Andrea Rossato
Brandon S. Allbery KF8NH
Bulat Ziganshin
David Benbennick
David Roundy
Don Stewart
Dougal Stanton
John Meacham
Jules Bean
Ketil Malde
Seth Gordon
Stefan O'Rear
Stephane Bortzmeyer