
A SourceForge project for the internationalisation effort is active at http://sourceforge.net/projects/haskell-i18n/ I've added my Unicode character properties code. Check it out (cvs co). Note that the build process grabs about 850K of data files from the Unicode website with wget. Also you'll need m4 and ghc installed. Also compiling some of the files takes forever. If you have haddock installed you make HTML documentation with 'make doc'. Feel free to contribute more useful functionality, such as encodings. Let me know if you want to be added as a developer. I've been making everyone an admin, I don't know if that's a good idea or not... -- Ashley Yakeley, Seattle WA

tor 2002-08-29 klockan 04.20 skrev Ashley Yakeley:
A SourceForge project for the internationalisation effort is active at http://sourceforge.net/projects/haskell-i18n/
I've added my Unicode character properties code. Check it out (cvs co).
Nice. Who will supply good UTF-8 code? I have some at http://www.dtek.chalmers.se/~d95mback/gettext/ but it is not in good shape. Where should a UTF-8 module be put? Text.UTF8? something like (just drafting here): Text.UTF8.encodeChar :: Char -> [Word8] -- (or Array?) Text.UTF8.encodeString :: String -> [Word8] -- (or Array?) Text.UTF8.decodeChar :: [Word8] -> Either (Char, [Word8]) Error Text.UTF8.decodeString :: [Word8] -> (String, [Word8], [Error]) Another thing, why not use the hierarchy even more, and put the sub-modules of Text.Unicode in Text.Unicode.UnicodeDefs and so on?
Feel free to contribute more useful functionality, such as encodings. Let me know if you want to be added as a developer. I've been making everyone an admin, I don't know if that's a good idea or not...
Trust everyone until they give you a reason not to... Regards, Martin -- Martin Norbäck d95mback@dtek.chalmers.se Kapplandsgatan 40 +46 (0)708 26 33 60 S-414 78 GÖTEBORG http://www.dtek.chalmers.se/~d95mback/ SWEDEN OpenPGP ID: 3FA8580B

On Thu, 2002-08-29 at 10:22, Martin Norbäck wrote:
tor 2002-08-29 klockan 04.20 skrev Ashley Yakeley:
A SourceForge project for the internationalisation effort is active at http://sourceforge.net/projects/haskell-i18n/
I've added my Unicode character properties code. Check it out (cvs co).
Nice. Who will supply good UTF-8 code? I have some at http://www.dtek.chalmers.se/~d95mback/gettext/ but it is not in good shape.
With the ICFP contest finally over, I have just committed mine to CVS (thanks for setting it up Ashley!). I hope it is of reasonable quality, I've not performance-tested it. I'm looking forward to all feed-back.
Where should a UTF-8 module be put? Text.UTF8?
In accordance with Simon's hierarchy page, I've put it into Text.Encoding.UTF8.
something like (just drafting here):
Text.UTF8.encodeChar :: Char -> [Word8] -- (or Array?) Text.UTF8.encodeString :: String -> [Word8] -- (or Array?) Text.UTF8.decodeChar :: [Word8] -> Either (Char, [Word8]) Error Text.UTF8.decodeString :: [Word8] -> (String, [Word8], [Error])
Pretty much! I have: encodeOne :: Char -> [Word8] -- encodeChar is probably prettier encode :: String -> [Word8] -- encodeString? I don't care. decodeOne :: [Word8] -> (Either Error Char, Int, [Word8]) -- 2nd. component: number of bytes consumed, -- 3rd. component: rest of bytes decode :: [Word8] -> (String, [(Error,Int)]) -- 2nd. component: list of errors and their index in the byte stream -- Maybe we should reverse the order of error/index -- so it looks like any association list? Comments welcome. Regards, Sven Moritz

Sven Moritz Hallberg wrote:
decodeOne :: [Word8] -> (Either Error Char, Int, [Word8]) -- 2nd. component: number of bytes consumed, -- 3rd. component: rest of bytes decode :: [Word8] -> (String, [(Error,Int)]) -- 2nd. component: list of errors and their index in the byte stream -- Maybe we should reverse the order of error/index -- so it looks like any association list?
Comments welcome.
IMHO:
1. Decoders should present a consistent interface; possibly by means
of a class. Modal encodings would need some kind of state.
2. "decode" should return the remaining bytes, in case the input ends
with a partial character (or contains errors; see next).
3. The basic decoder interface shouldn't attempt to recover from
errors. Rather, it should return the list of complete characters, the
list of remaining octets, and the final state. Any error recovery
should be an optional add-on.
--
Glynn Clements

Glynn Clements
decodeOne :: [Word8] -> (Either Error Char, Int, [Word8]) -- 2nd. component: number of bytes consumed, -- 3rd. component: rest of bytes
Huh? Either..or? I can't make sense of the declaration, what's with the commas?
Comments welcome.
I vote for 'decodeChar' for single characters, and just 'decode' for String. FWIW.
3. The basic decoder interface shouldn't attempt to recover from errors. Rather, it should return the list of complete characters, the list of remaining octets, and the final state. Any error recovery should be an optional add-on.
Could error handling be passed as a parameter to the encoder, perhaps? E.g. if I'm not really interested in debugging the code, just extracting what's possible, I could pass an error handler that tries to skip errors and keep going, without having to pollute my higher level code with it? -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Z. Malde wrote:
3. The basic decoder interface shouldn't attempt to recover from errors. Rather, it should return the list of complete characters, the list of remaining octets, and the final state. Any error recovery should be an optional add-on.
Could error handling be passed as a parameter to the encoder, perhaps? E.g. if I'm not really interested in debugging the code, just extracting what's possible, I could pass an error handler that tries to skip errors and keep going, without having to pollute my higher level code with it?
I would prefer not to see the base decoders cluttered with error
recovery functionality.
IMHO, a better alternative would be to provide functions to generate
fault-tolerant decoders from an existing decoder, e.g. by repeatedly
calling the underlying decoder until all octets have been consumed,
handling errors in either a predefined or user-defined manner.
Individual encodings could also provide custom fault-tolerant
decoders; in some cases, it may be desirable to have a choice of
several alternatives.
The nature of the problem differs substantially between different
types of encoding.
ISO-8859-* is trivial; one octet corresponds to one character. The
only possible error is an undefined codepoint (e.g. 0x80-0x9F); there
are no synchronisation issues.
UTF-8 is almost as simple; character boundaries are unambiguous, even
for invalid streams. However, there exists some variation between
existing decoders. Over-long sequences (e.g. using a two byte sequence
to represent a 7-bit character) are technically invalid, but many
decoders allow this; some applications even encourage it (e.g. using
0xC0,0x80 to represent an "embedded" NUL).
Other encodings may be more problematic, and error recovery may
involve some guesswork. This can be helped by knowledge of the
relative likelihood of certain classes of error and/or the nature of
the text (e.g. language).
My main concern is that less knowledgable users don't end up being
"steered" into using dubious semantics (e.g. fault tolerance,
especially when involving somewhat arbitrary heuristics) by way of
using the "default" interface.
Of particular concern are the potential security implications. The
most obvious[1] example is the use of invalid or ambiguous encoded
forms to circumvent access controls or input validation.
[1] Obvious to regular readers of BugTraq, at least; this specific
issue has been identified as a security problem in a wide range of
products.
Basically, my view is that if the user is required to explicitly
choose fault tolerance, there's more chance that they will consider
some of the issues involved than if some form of fault tolerance is
"bundled".
--
Glynn Clements

tis 2002-09-03 klockan 01.06 skrev Sven Moritz Hallberg:
On Thu, 2002-08-29 at 10:22, Martin Norbäck wrote:
tor 2002-08-29 klockan 04.20 skrev Ashley Yakeley:
A SourceForge project for the internationalisation effort is active at http://sourceforge.net/projects/haskell-i18n/
I've added my Unicode character properties code. Check it out (cvs co).
Nice. Who will supply good UTF-8 code? I have some at http://www.dtek.chalmers.se/~d95mback/gettext/ but it is not in good shape.
With the ICFP contest finally over, I have just committed mine to CVS (thanks for setting it up Ashley!). I hope it is of reasonable quality, I've not performance-tested it. I'm looking forward to all feed-back.
Hehe, I was about to commit my version yesterday, but decided to wait until today to check some things. Well, I'll commit anyway, UTF8norpan.hs and well see what happens :) They are quite different, so you could compare them. Must work now. Regards, Martin -- Martin Norbäck d95mback@dtek.chalmers.se Kapplandsgatan 40 +46 (0)708 26 33 60 S-414 78 GÖTEBORG http://www.dtek.chalmers.se/~d95mback/ SWEDEN OpenPGP ID: 3FA8580B

tis 2002-09-03 klockan 01.06 skrev Sven Moritz Hallberg:
With the ICFP contest finally over, I have just committed mine to CVS (thanks for setting it up Ashley!). I hope it is of reasonable quality, I've not performance-tested it. I'm looking forward to all feed-back.
My quickCheck tests found errors in three and four byte encodings: *Text.Encoding.TestUTF8> encodeOne (toEnum 51177) [236,159,140] *Text.Encoding.TestUTF8> decodeOne [236,159,140] (Right '\51148',3,[]) *Text.Encoding.TestUTF8> encodeOne (toEnum 845466) [243,206,154,154] *Text.Encoding.TestUTF8> decodeOne [243,206,154,154] (Left (InvalidLaterByte 1),1,[206,154,154]) 51177 should encode into [236,159,169] 845466 should encode into [243,142,154,154] I see you use hugs, you could consider using ghci, it's really nice and work exactly like hugs. Regards, Martin -- Martin Norbäck d95mback@dtek.chalmers.se Kapplandsgatan 40 +46 (0)708 26 33 60 S-414 78 GÖTEBORG http://www.dtek.chalmers.se/~d95mback/ SWEDEN OpenPGP ID: 3FA8580B

On Tue, 2002-09-03 at 10:46, Martin Norbäck wrote:
My quickCheck tests found errors in three and four byte encodings:
*Text.Encoding.TestUTF8> encodeOne (toEnum 51177) [236,159,140] *Text.Encoding.TestUTF8> decodeOne [236,159,140] (Right '\51148',3,[]) *Text.Encoding.TestUTF8> encodeOne (toEnum 845466) [243,206,154,154] *Text.Encoding.TestUTF8> decodeOne [243,206,154,154] (Left (InvalidLaterByte 1),1,[206,154,154])
51177 should encode into [236,159,169] 845466 should encode into [243,142,154,154]
Ooh. I'm on it, thanks for the test.
I see you use hugs, you could consider using ghci, it's really nice and work exactly like hugs.
Yes, my last coding session on it was done on my PowerBook with Linux, on which GHC unfortunately doesn't run, yet. :( Sven Moritz

On Tue, 2002-09-03 at 10:46, Martin Norbäck wrote:
My quickCheck tests found errors in three and four byte encodings:
*Text.Encoding.TestUTF8> encodeOne (toEnum 51177) [236,159,140] *Text.Encoding.TestUTF8> decodeOne [236,159,140] (Right '\51148',3,[]) *Text.Encoding.TestUTF8> encodeOne (toEnum 845466) [243,206,154,154] *Text.Encoding.TestUTF8> decodeOne [243,206,154,154] (Left (InvalidLaterByte 1),1,[206,154,154])
51177 should encode into [236,159,169] 845466 should encode into [243,142,154,154]
Fixed. All three properties seem to be satisfied in GHCi now. Beware of prop_DecEnc, though, quickCheck erroneously claims to falsify it. I guess this is caused by a bug in GHC 5.04's handling of strings that hit the list some time ago (try "\666"==['\666']). Manually applying prop_DecEnc to the alleged counter example has always yielded True for me. Regards, Sven Moritz

tor 2002-08-29 klockan 04.20 skrev Ashley Yakeley:
A SourceForge project for the internationalisation effort is active at http://sourceforge.net/projects/haskell-i18n/
Searching for "haskell" att sourceforge gave me these: hfl (Andrew Bromage) Haskell Foundation Library "The future of the Haskell standard library is here." CVS contains edison and some Monads hbase (Ashley Yakeley) "Common Haskell code for other semantic.org projects, and other useful things." CVS contains a number of good-looking modules (Org.Org.Semantic.*), among other things HBase.Text.UTF8. haskell-libs Haskell User-Submitted Libraries "Haskell User-Submitted Libraries includes anything we can get our hands on and clean up for general use, as well as new libraries that would be useful to the general Haskell community. Submit your libs, or join the project to create needed libs!" CVS contains Imap.hs haskell-i18n (the latest addition :) Perhaps we should try and merge these, is there any point in having multiple projects and repositories when we have hierarchical libraries? Regards, Martin -- Martin Norbäck d95mback@dtek.chalmers.se Kapplandsgatan 40 +46 (0)708 26 33 60 S-414 78 GÖTEBORG http://www.dtek.chalmers.se/~d95mback/ SWEDEN OpenPGP ID: 3FA8580B

G'day all. On Thu, Aug 29, 2002 at 11:31:16AM +0200, Martin Norbäck wrote:
hfl (Andrew Bromage) Haskell Foundation Library "The future of the Haskell standard library is here." CVS contains edison and some Monads
That would be me. HFL currently contains my personally hacked-over version of Edison, featuring fundeps and "not just maybe" methods, plus one or two of the gaps filled in. The other part is the Monad Template Library with one addition (MonadNondet) and one more in development, not yet checked in. My thinking behind hfl is similar to Boost for C++. In particular, I want to produce libraries which are suitable for future standardisation. That is, it has to be general enough, flexible enough and generally useful enough to be in a future incarnation of the standard library. To some extent, this is closely aligned with the philosophy behind haskell-i18n, but perhaps not with the other two projects, which look to me like more like CPAN, that is, a repository of useful libraries most of which most programs would not need. The line between the two ideas is somewhat fuzzy, but intuitively, an awful lot of programs need FiniteMap or something like it, but not so many need an IMAP library, even though an IMAP library is a very good thing to have for those times you need it.
Perhaps we should try and merge these, is there any point in having multiple projects and repositories when we have hierarchical libraries?
Semi-random half-baked thoughts follow. The major problem with Haskell libraries at the moment, and one of the reasons why I started HFL, is that there are a lot of existing libraries "out there" which have nothing in common. Naming schemes are inconsistent, error/exception handling is inconsistent, iterator support is handled in a dozen different ways and so on. This is partly a symptom of the fact that there's not a lot of common engineering experience with Haskell "out there". Those few places which do use Haskell (or similar languages) don't cross- pollinate with each other nearly as much as, say, C++ programmers. As a result, there are as many styles as there are programmers. Plus, to be honest, most of the people who write Haskell libraries are scientists. When you have a research quota to fill, integration with existing libraries isn't high on your agenda, and rightly so. Now I'm not opposed to merging in principle, however, if my intuition on the relative philosophies behind the four current projects is correct, then it seems to me that we really want is two (sub-)projects: the "core" libraries, and the "add on" libraries. Naturally we'd want these (sub)projects to work as symbiotically as possible. I'd also like, eventually, to institute some proper unit testing, code auditing and peer review, particularly on anything we're proposing for standardisation, but this can be discussed later. Cheers, Andrew Bromage

From: Andrew J Bromage [mailto:ajb@spamcop.net] Sent: 30 August 2002 02:33
Perhaps we should try and merge these, is there any point in having multiple projects and repositories when we have hierarchical libraries?
Semi-random half-baked thoughts follow.
The major problem with Haskell libraries at the moment, and one of the reasons why I started HFL, is that there are a lot of existing libraries "out there" which have nothing in common. Naming schemes are inconsistent, error/exception handling is inconsistent, iterator support is handled in a dozen different ways and so on.
Agreed that there are lots of inconsistent libraries out there, but why start a new project when there's already libraries@haskell.org? Surely this is the right point of focus for developing new libraries, and we also have a CVS repository for the code: fptools/libraries on cvs.haskell.org. We also have the beginnings of guidelines for naming conventions and coding style. Perhaps it's because it appears that the barrier to getting one's code into fptools/libraries is quite high. Really, it's not that hard - the only reasons I would actively argue against something going into fptools/libraries are: if there is duplication of functionality between libraries that will only serve to confuse users, or if there is substantial disagreement about whether a particular API is the "right thing". I see the process of standardisation as separate; at some point after the libraries have matured in fptools/libraries for some time, we will standardise some of them in a Haskell 98 addendum. This will probably be an ongoing process, with more libraries becoming standardised as they mature.
I'd also like, eventually, to institute some proper unit testing, code auditing and peer review, particularly on anything we're proposing for standardisation, but this can be discussed later.
Absolutely. Feel free to write down your ideas and we'll integrate them into the library project documentation, such as it is: fptools/libraries/docs/libraries.sgml, there's an online version here: http://www.haskell.org/~simonmar/libraries/libraries.html Cheers, Simon

G'day all. On Fri, Aug 30, 2002 at 10:09:25AM +0100, Simon Marlow wrote:
Agreed that there are lots of inconsistent libraries out there, but why start a new project when there's already libraries@haskell.org? Surely this is the right point of focus for developing new libraries, and we also have a CVS repository for the code: fptools/libraries on cvs.haskell.org. We also have the beginnings of guidelines for naming conventions and coding style.
The short answer is that, speaking as a user of GHC and GHCi, I'd prefer to have libraries with mature interfaces "out of the box". Integrating experimental libraries too early inevitably creates a body of legacy code that I don't want to be responsible for. For example, for my part I've made a lot of changes in my version of Edison which are not backwards-compatible, which I believe are an improvement, but I still don't know if I've got it "right" yet. I don't want to break everyone's Edison code only to have it break again next release. Of course if the changes turn out not to require incompatibilities, there's nothing stopping me from submitting them, but had I been hacking fptools/libraries to begin with, I might have been more hesitant about playing with existing libraries in the first place. The long answer I won't go into in detail, but part of the problem is that being a fptools/libraries developer basically means having a GHC development environment. That requires an investment which I'm personally not able to make at the moment.
Really, it's not that hard - the only reasons I would actively argue against something going into fptools/libraries are: if there is duplication of functionality between libraries that will only serve to confuse users, or if there is substantial disagreement about whether a particular API is the "right thing".
...both of which describe exactly the situation that I'm in at the moment!
I see the process of standardisation as separate; at some point after the libraries have matured in fptools/libraries for some time, we will standardise some of them in a Haskell 98 addendum. This will probably be an ongoing process, with more libraries becoming standardised as they mature.
This raises an interesting question, because almost all newer libraries (certainly the ones I write) all use non-98 language features. I suspect that this is true of many new libraries, and certainly most of the "interesting" ones. Cheers, Andrew Bromage

[trimmed recipient list]
The long answer I won't go into in detail, but part of the problem is that being a fptools/libraries developer basically means having a GHC development environment. That requires an investment which I'm personally not able to make at the moment.
Whilst I don't think it'll change your mind, there is a considerably easier route to being a libraries developer: use Hugs. Hugs is easy to work with, we tend to avoid complex Makefiles, we don't have umpteen different 'way's to build our libraries, recompiling is really fast, etc. Ross Paterson has been working away at making a lot of the new hierarchial libraries work with Hugs; I've added support for the latest ffi spec to Hugs; and Sigbjorn Finne has been plugging gaps in Hugs' library support where Hugs-specific code is required. The result is that it is now quite feasible to use Hugs when working on the hierarchial libraries. [Of course, I'm talking about the CVS copy of Hugs here - but this is little hardship since you'd certainly be working with the CVS copy of the libraries.] Some caveats: 1) This doesn't take away from Andrew's point about not wanting too much experimentation in the supposedly stable libraries. 2) Of course, Hugs doesn't have anything comparable to GHC's profiling infrastructure and a few other cool GHC support tools. 3) What I say is true for almost any library you want to develop. Any library that is except Unicode - Hugs still only supports 8-bit Chars. :-( -- Alastair Reid alastair@reid-consulting-uk.ltd.uk Reid Consulting (UK) Limited http://www.reid-consulting-uk.ltd.uk/alastair/

Hi Ashley,
Ashley Yakeley
A SourceForge project for the internationalisation effort is active at http://sourceforge.net/projects/haskell-i18n/
I've added my Unicode character properties code. Check it out (cvs co).
Great, thanks! However since it isn't possible to checkout the toplevel directory: % cvs -z3 -d:pserver:anonymous@cvs.haskell-i18n.sourceforge.net:/cvsroot/haskell-i18n co haskell-i18n cvs server: cannot find module `haskell-i18n' - ignored cvs [checkout aborted]: cannot expand modules (I managed to checkout Sources though), it might be better to follow the usual SF convention and rename "Sources/" to "haskell-i18n/" for consistency. The LICENSE file should then go into "haskell-i18n/". Cheers, Jens
participants (9)
-
Alastair Reid
-
Andrew J Bromage
-
Ashley Yakeley
-
Glynn Clements
-
Jens Petersen
-
ketil@ii.uib.no
-
Martin Norbäck
-
Simon Marlow
-
Sven Moritz Hallberg