
Here is a very interesting little problem. ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done. Prelude> :m System.Process Prelude System.Process> runCommand "echo привет" ?@825B This is a minimal test case for a bug reported in HSH at http://github.com/jgoerzen/hsh/issues#issue/1 It is not entirely clear to me what the behavior here should be. It seems inconsistent with the default behavior of System.IO to, apparently, just strip the bits higher than 0xFF. On the other hand, when it's OS commands we're talking about, it's not entirely clear to me if the default should be to encode in UTF-8. There should almost certainly be an *option* controlling this, and perhaps a version of runProcess that accepts ByteStrings. Thoughts? -- John

В сообщении от 23 апреля 2010 21:44:29 John Goerzen написал:
Here is a very interesting little problem.
ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done. Prelude> :m System.Process Prelude System.Process> runCommand "echo привет" ?@825B
This is a minimal test case for a bug reported in HSH at http://github.com/jgoerzen/hsh/issues#issue/1
It is not entirely clear to me what the behavior here should be. It seems inconsistent with the default behavior of System.IO to, apparently, just strip the bits higher than 0xFF. On the other hand, when it's OS commands we're talking about, it's not entirely clear to me if the default should be to encode in UTF-8. There should almost certainly be an *option* controlling this, and perhaps a version of runProcess that accepts ByteStrings.
It should just use system locale for encoding like System.IO do. FYI I just submitted bug to GHC trac: http://hackage.haskell.org/trac/ghc/ticket/4006 P.S. Haskell libraries aren't very well designed with respect to unicode and company.

John Goerzen
ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done. Prelude> :m System.Process Prelude System.Process> runCommand "echo привет" ?@825B
Are you arguing about IO-specific stuff like this, or for all non-ASCII Strings? -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Ivan Lazar Miljenovic wrote:
John Goerzen
writes: ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done. Prelude> :m System.Process Prelude System.Process> runCommand "echo привет" ?@825B
Are you arguing about IO-specific stuff like this, or for all non-ASCII Strings?
I'm not sure I understand the question. I consider the behavior in System.IO to be well-documented. The behavior in System.Process is not documented at all. As I said, I'm not certain what the proper answer is, but not documenting what happens probably isn't it. Actually, the behavior of openFile when given a String with characters > 0xFF is also completely undocumented. I am not sure what it does with that. It should probably be the same as runCommand, whatever it is. -- John

В сообщении от 24 апреля 2010 03:50:54 John Goerzen написал:
Ivan Lazar Miljenovic wrote:
John Goerzen
writes: ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done. Prelude> :m System.Process Prelude System.Process> runCommand "echo привет" ?@825B
Are you arguing about IO-specific stuff like this, or for all non-ASCII Strings?
I'm not sure I understand the question. I consider the behavior in System.IO to be well-documented. The behavior in System.Process is not documented at all. As I said, I'm not certain what the proper answer is, but not documenting what happens probably isn't it.
Actually, the behavior of openFile when given a String with characters > 0xFF is also completely undocumented. I am not sure what it does with that. It should probably be the same as runCommand, whatever it is.
Under unices file names are just array of bytes. There is no notion of encoding at all. It's just matter of interpretation of that array. There is a problem with FilePath data type. It's String actually but it should be abstract data type. There is relevant bug[1] on GHC trac. P.S. openFile truncates Chars. [1] http://hackage.haskell.org/trac/ghc/ticket/3307

Khudyakov Alexey wrote:
Actually, the behavior of openFile when given a String with characters > 0xFF is also completely undocumented. I am not sure what it does with that. It should probably be the same as runCommand, whatever it is.
Under unices file names are just array of bytes. There is no notion of encoding at all. It's just matter of interpretation of that array.
Quite right. One must be able to pass binary strings, which contain anything except \0 and '/' to openFile. The same goes for runCommand. I am uncomfortable, for this reason, with saying that runCommand ought to re-encode in the system locale while openFile doesn't. It is preferable to drop characters than to drop the ability to pass arbitrary binary data. So I am not sure I agree with your stance in http://hackage.haskell.org/trac/ghc/ticket/4006 -- John

В сообщении от 24 апреля 2010 06:14:55 вы написали:
Khudyakov Alexey wrote:
Actually, the behavior of openFile when given a String with characters > 0xFF is also completely undocumented. I am not sure what it does with that. It should probably be the same as runCommand, whatever it is.
Under unices file names are just array of bytes. There is no notion of encoding at all. It's just matter of interpretation of that array.
Quite right. One must be able to pass binary strings, which contain anything except \0 and '/' to openFile. The same goes for runCommand. I am uncomfortable, for this reason, with saying that runCommand ought to re-encode in the system locale while openFile doesn't. It is preferable to drop characters than to drop the ability to pass arbitrary binary data.
But truncation makes impossible to pass non ASCII strings portably. They should be encoded there is no easy way to do so. Actually problem is use of strings. String is sequence of _characters_ and program talk to outside world using sequence of bytes. I think that right (but impossible) way to solve this problem is to use separate data types for file path, command line arguments. Something along the lines:
data FilePath = ...
stringToFilePath :: String -> Maybe FilePath filePathToString :: FilePath -> Maybe String
Both functions are non total hence presence of Maybes. But it break a LOT of code and violate language definition. I think there are two alternatives. One is to encode/decode strings using current locale and provide [Word8] based variants. Main problem is that seeming innocent actions like getting directory content could crash program (exception ) Another options is to provide function to encode/decode strings. This is ugly and mix strings which hold characters and string which hold bytes and completely unhaskellish but it seems there is no good solution. Also truncation could have security implications. It makes almost impossible to escape dangerous characters robustly. Consider following code. This is more matter of speculations than real threat but nevertheless:
evil, maskedEvil :: String evil = "I am an evil script; date; echo I\\'m doing whatever I want" maskedEvil = map (toEnum . (+256) . fromEnum) evil
-- Should escape all dangerous chars escape :: String -> String escape = id
oops :: IO () oops = do runCommand ("echo " ++ maskedEvil ++ "") return ()

Actually, the behavior of openFile when given a String with characters > 0xFF is also completely undocumented. I am not sure what it does with that. It should probably be the same as runCommand, whatever it is.
Actually, the behaviour of openFile is known to be platform-dependent. According to Simon Marlow,
Be careful with FilePaths. On Windows they are interpreted as Unicode, on Unix they are interpreted as [Word8], by taking the low 8 bits of each Char. So if you always encode FilePaths to UTF-8, that will break on Windows. Fixing FilePaths is a high priority.
The last sentence gives me some hope. http://ghcmutterings.wordpress.com/2009/09/30/heads-up-what-you-need-to-know...
But truncation makes impossible to pass non ASCII strings portably. They should be encoded there is no easy way to do so.
Actually problem is use of strings. String is sequence of _characters_ and program talk to outside world using sequence of bytes. I think that right (but impossible) way to solve this problem is to use separate data types for file path, command line arguments.
I think that Strings should be used _only_ for characters (code points). Using the same data type for encoded/truncated data is dangerous. Most of the current problems with Unicode is due to the fact that Strings could turn out to be anything (are not strictly typed from this point of view). Hence, the runtime checks and hacks like isUTF8Encoded :: String -> Bool, encodeString :: String -> String and decodeString :: String -> String... So I absolutely support that truncating is wrong. Expecting encoded data in Strings is wrong too. So the only option (except changing the standard library and introducing a new type for FilePath), is to do all necessary conversions inside openFile and similar functions.
I think there are two alternatives. One is to encode/decode strings using current locale and provide [Word8] based variants. Main problem is that seeming innocent actions like getting directory content could crash program (exception )
Actually, any IO action is unpredictible. So trying to get directory contents can produce an error (for various reasons, e.g. permission denied). If it reports an error when there are filenames not presentable in the current locale (e.g. contain invalid UTF-8 sequences in UTF-8 locale), the problem is likely to be the wrong locale settings. What's the problem with an exception? I think [Word8] variants for those who wants to deal with such cases (guess file system encoding etc.) is enough.
Another options is to provide function to encode/decode strings. This is ugly and mix strings which hold characters and string which hold bytes and completely unhaskellish but it seems there is no good solution.
This is ugly, because it's impossible to know if a String is already encoded or not. This is ugly because application code will be polluted with conditional compilation to be cross-platform (or worse, people will forget to write cross-platform code in _some_ cases).
Also truncation could have security implications. It makes almost impossible to escape dangerous characters robustly. Consider following code. This is more matter of speculations than real threat but nevertheless:
Nice example. It shows that escaping should be the last step. S.
participants (4)
-
Ivan Lazar Miljenovic
-
John Goerzen
-
Khudyakov Alexey
-
S. Astanin