Some starters for the new list

Let's have an open discussion on internationalization issues here. Some issues to discuss. * Unicode in standard IO functions Since Haskell Chars are Unicode, should standard Haskell 98 IO functions be made to obey the locale charset? If the locale charset is iso-8859-1 it will work as before (possibly with some transliteration done, instead of just taking character values modulo 256). See also thread on linux-utf8 mailing list (http://mail.nl.linux.org/linux-utf8/2000-10/msg00139.html). There are no semantics on things like putStr in the library report, there ought to be. * Unicode library We need a good Unicode library apart from the standard IO functions that can be used with specialized IO, in bindings and so on. * Other locale issues Should there be locale dependent show functions (, for decimal separator etc.) * Translation issues What is the best way to make programs translatable? I have some files at http://www.dtek.chalmers.se/~d95mback/gettext/ which could be commented on. I propose that when the Haskell Wiki comes online, we use that to produce draft enhancements to the existing Haskell 98 language report, and perhaps the new Unicode libraries. Regards, Martin -- Martin Norbäck d95mback@dtek.chalmers.se Kapplandsgatan 40 +46 (0)708 26 33 60 S-414 78 GÖTEBORG http://www.dtek.chalmers.se/~d95mback/ SWEDEN OpenPGP ID: 3FA8580B

Martin Norbäck
* Translation issues What is the best way to make programs translatable? I have some files at http://www.dtek.chalmers.se/~d95mback/gettext/ which could be commented on.
I just want to repeat something somebody suggested, and which I thought was a really neat idea: Have string constants in programs be replaced by (Prelude.fromString "..") or similar, like numerical constants are handled already. This was suggested in order to simplify the use of PackedString, but I think it might come in handy for translation issues, too. Granted, it's not a complete printf with argument permutation and all, but at least it should be extremely simple, and, hey, it'll probably suffice for *my* uses :-) (Naturally, the idea is that Prelude.fromString can be repaced by a function that looks the string up in a translation table, instead of using the default value. Any reason this won't work?) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

tor 2002-08-15 klockan 10.28 skrev Ketil Z Malde:
I just want to repeat something somebody suggested, and which I thought was a really neat idea: Have string constants in programs be replaced by (Prelude.fromString "..") or similar, like numerical constants are handled already.
This was suggested in order to simplify the use of PackedString, but I think it might come in handy for translation issues, too. Granted, it's not a complete printf with argument permutation and all, but at least it should be extremely simple, and, hey, it'll probably suffice for *my* uses :-)
(Naturally, the idea is that Prelude.fromString can be repaced by a function that looks the string up in a translation table, instead of using the default value. Any reason this won't work?)
It would probably work, but doing it like this will probably doesn't be a good solution for translation. Why? Because not all strings should be translated. The programmer has to mark the strings that should be translated, by using a function like gettext (or __). Using it for PackedString should be good, something like template Haskell could solve this without requiring more language extensions. Regards, Martin -- Martin Norbäck d95mback@dtek.chalmers.se Kapplandsgatan 40 +46 (0)708 26 33 60 S-414 78 GÖTEBORG http://www.dtek.chalmers.se/~d95mback/ SWEDEN OpenPGP ID: 3FA8580B

I just want to repeat something somebody suggested, and which I thought was a really neat idea: Have string constants in programs be replaced by (Prelude.fromString "..") or similar, like numerical constants are handled already.
This was suggested in order to simplify the use of PackedString, but I think it might come in handy for translation issues, too.
I find it a little hard to picture this so let's fill in some details so that we can agree that we're talking about the same thing and also to make the idea more concrete. Using typeclasses in this way would require us to make the encoding explicit in the typesystem. So we'd define a bunch of types corresponding to characters and to strings: data Char = .. -- unicode data Latin1 = ... -- Latin1 ... and we'd define two classes and the basic operations on them. class Enum a => Charset a where fromChar :: Char -> a class Ord a => String a where fromString :: String -> a Why did I define two classes instead of just one? The more obvious design was to have class Enum a => Charset a where fromChar :: Char -> a fromString :: String -> [a] but this wouldn't let us make PackedString an instance of it. This could be fixed using multiparameter type classes but splitting the class is easier. (We might revisit this decision if we want operations to convert Charsets to Strings and the like.) Details: - We might want to add operations to convert back to Unicode - though that might require additional parameters to fill in details not encoded in the type? - What should we do if the conversion fails? For example, if I try to convert the unicode yin-yang character (\u262f) to Latin1? - We probably want additional operations for strings like map, append, etc. - fromString should be applied to strings used in patterns. - This requires a minor change in the report which states that a string literal is just an abbreviation for a list of characters. Overall, this looks like it might be a viable approach. The only potential showstoppers seem to be what to do when conversion fails.
(Naturally, the idea is that Prelude.fromString can be repaced by a function that looks the string up in a translation table, instead of using the default value. Any reason this won't work?)
This goes quite a bit further than what I suggest above but let's try to sketch it out. 1) You have to define a new string type: newtype FrenchString = FS String 2) You have to define an instance: instance String FrenchString where fromString (FS "General Protection Fault") = "..." fromString (FS "File not found") = "..." ... fromString (FS _) = ???? Well, it seems simple enough. Once again though, we have the problem of what to do when the conversion fails. What happens in the real world? Do they print the string in English and hope for the best? I don't feel entirely comfortable with doing things this way. I think I'df prefer to see an explicit call to a translation function like 'toFrench'. I presume that the advantage of this approach would be that you could use existing libraries without change? Unfortunately, the way I've sketched it out, the code has to be modified to use the type 'FrenchString' instead of 'String' so we don't achieve this goal. Overall, this doesn't look like it will work. -- Alastair Reid alastair@reid-consulting-uk.ltd.uk Reid Consulting (UK) Limited http://www.reid-consulting-uk.ltd.uk/alastair/

On Thu, Aug 15, 2002 at 01:11:46PM +0100, Alastair Reid wrote:
(Naturally, the idea is that Prelude.fromString can be repaced by a function that looks the string up in a translation table, instead of using the default value. Any reason this won't work?)
This goes quite a bit further than what I suggest above but let's try to sketch it out.
1) You have to define a new string type: 2) You have to define an instance:
I think this is not really what was meant. The gettext C function takes a C String, looks it up in a translation table and returns a pointer with the translated string (or the original if no translation is found). AFAIK e.g. ghc does store string literals as C char* and not as a Haskell list. Thus the question is how to tell the compiler that it should use peekCString (gettext strPtrToLiteral) instead of peekCString strPtrToLiteral for each string literal in the source code. We surely don't want to generate a Haskell list of characters, convert that back to a C string buffer, call gettext and convert the return value again to a Haskell list of Char. Making string literals work with [Char] and PackedString is an orthogonal issue to i18n. This would involve introducing a class and instances. When Haskell source files are read by character-set aware versions of readFile then ghc has to change in that string literals have to be stored in UTF-8 and accessed by something like peekUTF8String. Axel.

Aren't you mixing two different problems? I see these: 1) Choose a string in a language of the user's preference. 2) (De)serialize characters according to some codec. I still feel that, "inside the Haskell universe", a Char should be just that: a character. This, to me, implies that it should be able to hold every possible character "value" - i.e. Char should reperesent a Unicode character (code point - is that the correct term?). Now we must tackle the two problems above, I'll in this mail concentrate on no. 1: What's wrong about this: data Lang = En | Fr | De | ... data Msg = Hello | NoSuchFile | ... trans En Hello = "Hello, World!" trans Fr Hello = "Bonjour le monde!" trans De Hello = "Hallo Welt!" ... trans En _ = error "Gah, provide at least english!" trans _ m = trans En m main = l <- systemLang -- returns a Lang putStr (trans l Hello) Or, if you want string lookup to avoid the extra data type(s): trans "en" = id trans "fr" "Hello, World!" = "Bonjour, \231a va?" trans "de" "Hello, World!" = "Moin moin!" trans _ msg = msg main = l <- systemLang -- would return String in this case putStr (trans l Hello) Of course you have to pass the language parameter to all non-IO functions, but would you want it otherwise? After some pondering, I think we should base i18n on the first snippet's approach: - There will be a data type representing a "message" which will be displayed to the user. The message is then translated to a string in a given language. - The languages are, this appears natural to me, values of a data type again. We can have ctors for everything there are ISO codes for or so. In order to be extensible, a ctor taking a string argument appears suitable. I suspect we can build all the convenience we need on top of this easy and clear to understand basis. For instance, if we want the string lookup way of things, one can just use String as the message type, like this: type Msg = String trans The choice is up to the application developer. I'd personally tend to use a real data type, because that lets me do this: data Msg = ... | MessagesWaiting n | ... trans En MessagesWaiting n = "You have "++(show n)++m++" waiting." where m | n==1 = " message" | otherwise = " messages" trans De MessagesWaiting n | n==1 = "Sie haben eine neue Nachricht." = "Es warten "++(show n)++m++" auf Sie." where m | n==1 = " N I can't think of a more contrived example right now, but I'm sure many exist. Also note that (once there is some sort of locale-aware show) one will get the benefits of that coherently across all messages. On Thu, 2002-08-15 at 14:11, Alastair Reid wrote:
I just want to repeat something somebody suggested, and which I thought was a really neat idea: Have string constants in programs be replaced by (Prelude.fromString "..") or similar, like numerical constants are handled already.
This was suggested in order to simplify the use of PackedString, but I think it might come in handy for translation issues, too.
I find it a little hard to picture this so let's fill in some details so that we can agree that we're talking about the same thing and also to make the idea more concrete.
Using typeclasses in this way would require us to make the encoding explicit in the typesystem. So we'd define a bunch of types corresponding to characters and to strings:
data Char = .. -- unicode data Latin1 = ... -- Latin1 ...
and we'd define two classes and the basic operations on them.
class Enum a => Charset a where fromChar :: Char -> a class Ord a => String a where fromString :: String -> a
Why did I define two classes instead of just one? The more obvious design was to have
class Enum a => Charset a where fromChar :: Char -> a fromString :: String -> [a]
but this wouldn't let us make PackedString an instance of it. This could be fixed using multiparameter type classes but splitting the class is easier. (We might revisit this decision if we want operations to convert Charsets to Strings and the like.)
Details:
- We might want to add operations to convert back to Unicode - though that might require additional parameters to fill in details not encoded in the type?
- What should we do if the conversion fails? For example, if I try to convert the unicode yin-yang character (\u262f) to Latin1?
- We probably want additional operations for strings like map, append, etc.
- fromString should be applied to strings used in patterns.
- This requires a minor change in the report which states that a string literal is just an abbreviation for a list of characters.
Overall, this looks like it might be a viable approach. The only potential showstoppers seem to be what to do when conversion fails.
(Naturally, the idea is that Prelude.fromString can be repaced by a function that looks the string up in a translation table, instead of using the default value. Any reason this won't work?)
This goes quite a bit further than what I suggest above but let's try to sketch it out.
1) You have to define a new string type:
newtype FrenchString = FS String
2) You have to define an instance:
instance String FrenchString where fromString (FS "General Protection Fault") = "..." fromString (FS "File not found") = "..." ... fromString (FS _) = ????
Well, it seems simple enough. Once again though, we have the problem of what to do when the conversion fails. What happens in the real world? Do they print the string in English and hope for the best?
I don't feel entirely comfortable with doing things this way. I think I'df prefer to see an explicit call to a translation function like 'toFrench'. I presume that the advantage of this approach would be that you could use existing libraries without change? Unfortunately, the way I've sketched it out, the code has to be modified to use the type 'FrenchString' instead of 'String' so we don't achieve this goal.
Overall, this doesn't look like it will work.
-- Alastair Reid alastair@reid-consulting-uk.ltd.uk Reid Consulting (UK) Limited http://www.reid-consulting-uk.ltd.uk/alastair/ _______________________________________________ Haskell-i18n mailing list Haskell-i18n@haskell.org http://www.haskell.org/mailman/listinfo/haskell-i18n

Sorry, I accidently hit CTRL-Return which sent the message prematurely. Here is the rest again: On Thu, 2002-08-15 at 20:21, Sven Moritz Hallberg wrote:
[...]
After some pondering, I think we should base i18n on the first snippet's approach:
- There will be a data type representing a "message" which will be displayed to the user. The message is then translated to a string in a given language.
- The languages are, this appears natural to me, values of a data type again. We can have ctors for everything there are ISO codes for or so. In order to be extensible, a ctor taking a string argument appears suitable.
I suspect we can build all the convenience we need on top of this easy and clear to understand basis. For instance, if we want the string lookup way of things, one can just use String as the message type, like this: type Msg = String trans De "There is no space left on the storage device." = "Der Datentraeger ist voll." trans En = id The choice is up to the application developer. I'd personally tend to use a real data type, because that lets me do this: data Msg = ... | MessagesWaiting n | ... trans En MessagesWaiting n = "You have "++(show n)++m++" waiting." where m | n==1 = " message" | otherwise = " messages" trans De MessagesWaiting n | n==1 = "Sie haben eine neue Nachricht." | otherwise = "Es warten "++(show n)++" Nachrichten auf Sie." I can't think of a more contrived example right now, but I'm sure many exist. Also note that (once there is some sort of locale-aware show) one will get the benefits of that coherently across all messages. Reading translation tables from files is of course important. How about Read msg => readTransTable :: String -> IO (msg -> String) which takes a filename and yields a translation function? This would work for any data type in the Read class, String already is and a custom type could simply derive Read. Again, sorry for the accidental double post. :/ Regards, Sven Moritz
participants (5)
-
Alastair Reid
-
Axel Simon
-
Ketil Z Malde
-
Martin Norbäck
-
Sven Moritz Hallberg