
Hi, I would like to sugest a new basic type in Haskell. What if we had something like this (with any other quoting character): «Je ne parle pas français. Meu nome é Maurício. ¿Hablas español?» This would be of type Utf8. I think now it is not a bad idea, since Haskell source code is supposed to be utf-8. The internal representation of this datatype would be a null terminated utf-8 byte vector. No standard operations would be defined on that type, i.e., it would be a “communication standard” between everybody, but module writers could develop different basic usage based on operations on them using Foreign. (I think it would be dificult to set default operations, since there are so many things you can do with utf-8.) Pros: * There would be no doubt you can use utf-8 when using this, since there's no conversion involved. * Cleaner code on utf8 operations, maybe. There are many utf8 modules today with different goals in mind, I thing it would be nice if they could share this common basic type and a common underline implementation. Cons: * Probably, many. I have no deep understanding of Haskell. Thanks for your attention, Maurício

From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Mauricio
I would like to sugest a new basic type in Haskell. What if we had something like this (with any other quoting character):
«Je ne parle pas français. Meu nome é Maurício. ¿Hablas español?»
This would be of type Utf8. I think now it is not a bad idea, since Haskell source code is supposed to be utf-8. The internal representation of this datatype would be a null terminated utf-8 byte vector. ...
Stream fusion on Haskell Unicode strings - Tom Harper http://www.wellquite.org/non-blog/AngloHaskell2008/tom%20harper.pdf I don't know what it's status is. The original implementation used UTF16 rather than UTF8. Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

I would like to sugest a new basic type in Haskell. What if we had something like this (with any other quoting character):
«Je ne parle pas français. (...) ¿Hablas español?»
This would be of type Utf8. I think now it is not a bad idea, since Haskell source code is supposed to be utf-8. The internal representation of this datatype would be a null terminated utf-8 byte vector. ...
Stream fusion on Haskell Unicode strings - Tom Harper http://www.wellquite.org/non-blog/AngloHaskell2008/tom%20harper.pdf (...)
Actually, what I suggest is quite different, in points I see as worthwhile: * His focus is on speed and memory, my goal is more elegant and safe code. * His approach consolidates Prelude. My approach allows complete elimination of Prelude. If we had a Utf8 basic type, we could have modules with many different basic types, and many different ideas on how to 'read «something» :: <sometype>'. In the future, we could write a module to implement some sort of not yet invented numeral type, which other module would allow to be readed from Chinese kanji. * He wants to preserve many properties of [Char]. I think Utf8 type should have no standard properties at all. See next argument on why this would avoid some unsafe code. * He insists on the idea of text as something over char. Well, I'm probably alone there, but I think this was nice, but today we could have better approachs. Except for source code, text is a block of information, not a sequence of anything. I explicitly would like a type we could not map over, because we can't do that — text is built from so many things, there's no basic unit we can apply functions to. Even something like "printing of a table of all characters and their unicode numbers" is impossible, since a lot of unicode is not printable. "Are these blocks of text equal?" also do not work like that, since different sets of bytes can have the same meaning. If you want some piece of text to obey specific properties, you should have to extract it to a proper type. Sorry if this is insane for some reason. Thanks, Maurício

So this proposal is more than a UTF8 type, since it encompasses a move away from text as lists. What interfaces would we have to text in this proposal? -- _jsn

So this proposal is more than a UTF8 type, since it encompasses a move away from text as lists. What interfaces would we have to text in this proposal?
Normal users would import modules with specific interfaces, like functions or instances. One possible such module would be Streams like those sugested in the previous article. Others could offer functionality I don't know of -- maybe there's some usefull interface for japanese or greek users we (non japanese or greek) don't imagine. My first attempt would be PortugueseText, with a type that could only be built after Portuguese "primitives" or read from Utf8 with possible errors, and convert to Utf8 of course. That type would always convert to Utf8 with correct diacriticals, and sort with the latest Portuguese agreements. Mapping over syllables could be allowed, that makes sense in syllabic languages. Quotes, questions, parenthesis etc. could be done with functions like 'quote «Ser ou não ser»'. Other could be SimpleEnglishTextAsList, that could offer something close to what we have today, with functions for uppercasing, lowercasing and well behaved (non ambiguous) sorting. Writers of very basic modules would have to touch Utf8 using Foreign. So, maybe the only standard interface would be a (ForeignPtr?) pointer to a null terminated block of memory. This would make Foreign a new Prelude, maybe. In the end, this is just a basic block of memory that we can fill with utf-8 text, and that serves as a common denominador. Best, Maurício

Unlike native Strings, this would have the potential for a runtime parse error at every character. -- _jsn
participants (3)
-
Bayley, Alistair
-
Jason Dusek
-
Mauricio