Re: Haskell Platform Proposal: add the 'text' library

From: Roman Leshchinskiy
On 17/10/2010, at 21:25, Bryan O'Sullivan wrote:
On Mon, Oct 11, 2010 at 1:12 PM, Malcolm Wallace
wrote: The breakSubstring functionality is semantically: breakSubstring x = break (==x) although there may be a more efficient implementation. Proposal: rename Text.break to Text.breakSubstring, and Text.breakBy to Text.break.
So far, I've been proceeding on the basis that I'd like naming to be consistent and descriptive, and to have more commonly used functions get shorter names than their less commonly used (but possibly more general) cousins. For instance, breakSubstring is descriptive, and it's consistent with bytestring, but it's much longer than break, even though breaking on a fixed string is more common. In this case, length and frequency of use trump the other considerations in my mind.
FWIW, I take almost exactly the opposite approach with vector. I try to follow the list/array interface as closely as possible even in the presence of more frequently used but subtly different operations. My rationale is that typing a few extra characters is vastly preferable to having to search through the docs to find out what this particular library calls this particular function.
I think there's a significant difference between vector and text, namely a Vector is conceptually the same as a list/1D array, while a Text is not. I think this difference is enough to warrant a break from the list API. John

On 19/10/2010, at 15:22, John Lato wrote:
I think there's a significant difference between vector and text, namely a Vector is conceptually the same as a list/1D array, while a Text is not. I think this difference is enough to warrant a break from the list API.
Are you sure? From its interface Text looks exactly like a list of Chars to me. Roman

On 19 October 2010 22:08, Roman Leshchinskiy
On 19/10/2010, at 15:22, John Lato wrote:
I think there's a significant difference between vector and text, namely a Vector is conceptually the same as a list/1D array, while a Text is not. I think this difference is enough to warrant a break from the list API.
Are you sure? From its interface Text looks exactly like a list of Chars to me.
Right, that's a very common misunderstanding of Unicode. A Unicode code point (type Char) does not correspond 1:1 with the human notion of a character. It would be nice if it did, but unfortunately it is not something we can ignore. Because of this it is better not to think of operations on individual Chars but on short sequences of Chars. In any case, when processing text (even ASCII where Chars do match characters) many of the most common operations that you want are substring not element based. So this is why a list/vector API is not necessarily appropriate for text. Duncan

On Wed, Oct 20, 2010 at 12:35:33AM +0100, Duncan Coutts wrote:
On 19 October 2010 22:08, Roman Leshchinskiy
wrote: On 19/10/2010, at 15:22, John Lato wrote:
I think there's a significant difference between vector and text, namely a Vector is conceptually the same as a list/1D array, while a Text is not. I think this difference is enough to warrant a break from the list API.
Are you sure? From its interface Text looks exactly like a list of Chars to me.
Right, that's a very common misunderstanding of Unicode. A Unicode code point (type Char) does not correspond 1:1 with the human notion of a character. It would be nice if it did, but unfortunately it is not something we can ignore. Because of this it is better not to think of operations on individual Chars but on short sequences of Chars. In any case, when processing text (even ASCII where Chars do match characters) many of the most common operations that you want are substring not element based.
I believe Roman is referring to the Text API, which does indeed look a lot like the list API specialized to Char, with relatively few exceptions. The above would be an argument against including any of the functions with Char parameters, but a high proportion of them do.

On 10/19/10 7:59 PM, Ross Paterson wrote:
On Wed, Oct 20, 2010 at 12:35:33AM +0100, Duncan Coutts wrote:
On 19 October 2010 22:08, Roman Leshchinskiy
wrote: On 19/10/2010, at 15:22, John Lato wrote:
I think there's a significant difference between vector and text, namely a Vector is conceptually the same as a list/1D array, while a Text is not. I think this difference is enough to warrant a break from the list API.
Are you sure? From its interface Text looks exactly like a list of Chars to me.
Right, that's a very common misunderstanding of Unicode. A Unicode code point (type Char) does not correspond 1:1 with the human notion of a character. It would be nice if it did, but unfortunately it is not something we can ignore. Because of this it is better not to think of operations on individual Chars but on short sequences of Chars. In any case, when processing text (even ASCII where Chars do match characters) many of the most common operations that you want are substring not element based.
I believe Roman is referring to the Text API, which does indeed look a lot like the list API specialized to Char, with relatively few exceptions. The above would be an argument against including any of the functions with Char parameters, but a high proportion of them do.
<musing> I almost wonder if it would be worth it to define a new type, Character, which does correspond 1:1 to the human notion of a "character" (being intentionally vague about what exactly that means). Then we could have that Text is a vector/list/sequence of Characters, and give it the appropriate interface for being thought of that way. Of course, under the covers, Character would just be a newtype of Text[1] and so the bulk of text/text-icu implementation would need no changes. At least, it seems like that might make it possible for us to get out of this impasse about the text library matching vector/list/sequence APIs when Text is not a vector/list/array of Char. Also, it helps to codify what we mean by "a short sequence of Chars", which could possibly allow for some simplifying assumptions for the algorithms being used (since often there are better (X,X)->Y algos available when we know one of the X is much smaller than the other). </musing> [1] Using a type alias seems like it'd be too easy to break the API idealization. -- Live well, ~wren

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/19/10 22:36 , wren ng thornton wrote:
<musing> I almost wonder if it would be worth it to define a new type, Character, which does correspond 1:1 to the human notion of a "character" (being intentionally vague about what exactly that means). Then we could have that Text is a vector/list/sequence of Characters, and give it the appropriate interface for being thought of that way.
I believe Perl 6 is going this way; while there is a single base type Str and role String, there are three different things it can "mean" (call them subtypes): bytes, Unicode code points, graphemes (the latter corresponding to the proposed Character). Or possibly only two of those; IIRC recently it was proposed that the byte version be moved to the already existing Buf type/Buffer role intended for binary data, roughly equivalent to ByteString. If a given string is accessed as code points, it can't then be treated as graphemes unless re-assigned to, and vice versa, but assigning it to another Str allows that Str to be accessed as graphemes instead. (I think. The Perl 6 spec is still a moving target, as evidenced by the thing about byte access; it's entirely possible that this changed again and I missed it. But there was definitely thought put into the distinction between bytes, codepoints, and graphemes.) - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAky+ecwACgkQIn7hlCsL25USGgCeOQZdx4PBCjc7yF0LwSRdyYEp E1IAniYszij4vGohwPtGOkB/weNB6TEF =NhB/ -----END PGP SIGNATURE-----

On 20/10/2010, at 00:35, Duncan Coutts wrote:
On 19 October 2010 22:08, Roman Leshchinskiy
wrote: On 19/10/2010, at 15:22, John Lato wrote:
I think there's a significant difference between vector and text, namely a Vector is conceptually the same as a list/1D array, while a Text is not. I think this difference is enough to warrant a break from the list API.
Are you sure? From its interface Text looks exactly like a list of Chars to me.
Right, that's a very common misunderstanding of Unicode. A Unicode code point (type Char) does not correspond 1:1 with the human notion of a character.
True. But I didn't say it's a list of characters. It's a list of Chars, i.e., of Unicode code points.
It would be nice if it did, but unfortunately it is not something we can ignore. Because of this it is better not to think of operations on individual Chars but on short sequences of Chars. In any case, when processing text (even ASCII where Chars do match characters) many of the most common operations that you want are substring not element based.
So this is why a list/vector API is not necessarily appropriate for text.
But text *does* have a list API. It doesn't get any "listier" than uncons, foldr and unfoldr. From my point of view, if these functions make sense for the data type then it's a list. If it's not a list then these functions don't make sense and shouldn't be provided. Roman

On Wed, 2010-10-20 at 07:52 +0100, Roman Leshchinskiy wrote:
True. But I didn't say it's a list of characters. It's a list of Chars, i.e., of Unicode code points.
But text *does* have a list API. It doesn't get any "listier" than uncons, foldr and unfoldr. From my point of view, if these functions make sense for the data type then it's a list. If it's not a list then these functions don't make sense and shouldn't be provided.
So yes it is a list of code points, there's no getting away from that. But that does not mean that the operations you want on it are the same as you want on say a list of records, where each element is independent. It's more like a sequence of DNA where you very often want to consider short sequences of code points, rather than individual ones. Duncan

On October 19, 2010 19:35:33 Duncan Coutts wrote:
Right, that's a very common misunderstanding of Unicode. A Unicode code point (type Char) does not correspond 1:1 with the human notion of a character. It would be nice if it did, but unfortunately it is not something we can ignore. Because of this it is better not to think of operations on individual Chars but on short sequences of Chars. In any case, when processing text (even ASCII where Chars do match characters) many of the most common operations that you want are substring not element based.
I read the wikipedia article on code points, but still do not feel I have a firm grasp as to what exactly you are referring to. If you have a few minutes, would you mind providing a short example to clarify this with a specific example (e.g., a specific code point that gives issues with a 1:1 model and what those issues are). Thanks! -Tyson

On Wed, Oct 20, 2010 at 5:11 PM, Tyson Whitehead
I read the wikipedia article on code points, but still do not feel I have a firm grasp as to what exactly you are referring to.
If you have a few minutes, would you mind providing a short example to clarify this with a specific example (e.g., a specific code point that gives issues with a 1:1 model and what those issues are).
Have a look at combining characters: http://en.wikipedia.org/wiki/Combining_character For example, a Danish user would consider the single Unicode code point A-RING the same as the two code points A + COMBINING RING. If you process a string code point by code point you might mistakenly confuse a plain "a" (A) with a "å" (A-RING *or* A + COMBINING RING). Johan

On Wed, Oct 20, 2010 at 05:28:15PM +0200, Johan Tibell wrote:
If you process a string code point by code point you might mistakenly confuse a plain "a" (A) with a "å" (A-RING *or* A + COMBINING RING).
But isn't that what text does here?: Data.Text Data.Text.IO> let t = pack "z\x0061\x030A\x0061z" Data.Text Data.Text.IO> putStrLn t zåaz Data.Text Data.Text.IO> putStrLn (replace (pack "a") (pack "y") t) zẙyz Thanks Ian

On Wed, Oct 20, 2010 at 5:59 PM, Ian Lynagh
On Wed, Oct 20, 2010 at 05:28:15PM +0200, Johan Tibell wrote:
If you process a string code point by code point you might mistakenly confuse a plain "a" (A) with a "å" (A-RING *or* A + COMBINING RING).
But isn't that what text does here?:
Data.Text Data.Text.IO> let t = pack "z\x0061\x030A\x0061z" Data.Text Data.Text.IO> putStrLn t zåaz Data.Text Data.Text.IO> putStrLn (replace (pack "a") (pack "y") t) zẙyz
I think the right thing to do here is to perform normalization first but I'm not sure. Some Unicode expert should chime in about what the right semantics of 'replace' is. Johan

On Wed, Oct 20, 2010 at 9:52 AM, Johan Tibell
I think the right thing to do here is to perform normalization first but I'm not sure.
Hi, friendly neighbourhood Unicode expert here. Yes, in the case Ian cites, you want to perform normalization before doing the replacement. The behaviour he demonstrates is normal, expected, and consistent with the standard. For more details, see http://unicode.org/reports/tr15

On Wed, Oct 20, 2010 at 09:57:04AM -0700, Bryan O'Sullivan wrote:
On Wed, Oct 20, 2010 at 9:52 AM, Johan Tibell
wrote: I think the right thing to do here is to perform normalization first but I'm not sure.
Hi, friendly neighbourhood Unicode expert here. Yes, in the case Ian cites, you want to perform normalization before doing the replacement. The behaviour he demonstrates is normal, expected, and consistent with the standard.
OK, so that works with the previous example: Data.Text Data.Text.IO Data.Text.ICU> let t = pack "z\x0061\x030A\x0061z" Data.Text Data.Text.IO Data.Text.ICU> t "za\778az" Data.Text Data.Text.IO Data.Text.ICU> putStrLn t zåaz Data.Text Data.Text.IO Data.Text.ICU> normalize NFC t "z\229az" Data.Text Data.Text.IO Data.Text.ICU> putStrLn (normalize NFC t) zåaz Data.Text Data.Text.IO Data.Text.ICU> putStrLn (replace (pack "a") (pack "y") (normalize NFC t)) zåyz but only because now characters and codepoints are 1:1. If we were using a character for which there is no code point, e.g. (the probably non-existent, but I understand there are real examples) p-ring: Data.Text Data.Text.IO Data.Text.ICU> let t = pack "zp\x030Apz" Data.Text Data.Text.IO Data.Text.ICU> t "zp\778pz" Data.Text Data.Text.IO Data.Text.ICU> putStrLn t zp̊pz Data.Text Data.Text.IO Data.Text.ICU> normalize NFC t "zp\778pz" Data.Text Data.Text.IO Data.Text.ICU> putStrLn (normalize NFC t) zp̊pz Data.Text Data.Text.IO Data.Text.ICU> putStrLn (replace (pack "p") (pack "y") (normalize NFC t)) zẙyz then it doesn't work. Johan wrote:
If you process a string code point by code point you might mistakenly confuse a plain "a" (A) with a "å" (A-RING *or* A + COMBINING RING).
But when characters and codepoints are 1:1, you /can/ process code point by code point. Am I missing something? Thanks Ian

On Oct 20, 2010, at 19:44, Ian Lynagh wrote:
Johan wrote:
If you process a string code point by code point you might mistakenly confuse a plain "a" (A) with a "å" (A-RING *or* A + COMBINING RING).
But when characters and codepoints are 1:1, you /can/ process code point by code point.
Am I missing something?
AFAIK there are scripts that have so many combinations that Unicode does not have a single codepoints for each character. In Arabic you can have one of 5 vowel signs on each of the 28 letters. But Unicode does not provide 5*28 codepoints for the combinations. That is probably the reason for have these combined characters. Mac OS tries to take all the characters into as many codepoints as possible whereas Windows tries to merge them as much as possible. I don't think there is a good semantics for replace without knowing what (normal) form you're working on. Normally, search/replace and sorting on Unicode are specialized algorithms that cannot be reduces to simple substitutions or permutations. So I suggest to just provide functions on codepoints and let the user struggle with the rest. Cheers, Axel

AFAIK there are scripts that have so many combinations that Unicode does not have a single codepoints for each character. In Arabic you can have one of 5 vowel signs on each of the 28 letters. But Unicode does not provide 5*28 codepoints for the combinations. That is probably the reason for have these combined characters.
I dunno, they didn't seem to have much trouble jamming 11k or so hangul letters into the standard, as well as the smaller combining forms of them. Not sure about Arabic though.

On October 20, 2010 15:45:44 Axel Simon wrote:
AFAIK there are scripts that have so many combinations that Unicode does not have a single codepoints for each character. In Arabic you can have one of 5 vowel signs on each of the 28 letters. But Unicode does not provide 5*28 codepoints for the combinations. That is probably the reason for have these combined characters.
Mac OS tries to take all the characters into as many codepoints as possible whereas Windows tries to merge them as much as possible. I don't think there is a good semantics for replace without knowing what (normal) form you're working on. Normally, search/replace and sorting on Unicode are specialized algorithms that cannot be reduces to simple substitutions or permutations.
Thanks to everyone for the examples. Given that not all combined characters can be reduced to a single code point (from your first paragraph), it would seem that MacOS normalization has a conceptual advantage over Windows normalization. Specifically, it is appealing that the normalized string is in some sense less complex in that it only contains elementary codepoints (ones that can't be further decomposed) and compositions. The other would still contain a mix. Am I correct then in understanding that, from the view of strings as a vector/list of elementary chars, the elementary chars would actually have to be a codepoint plus an arbitrary number of additional composition codepoints in order to correspond well to the human notion of a character. This then doesn't map well onto the existing vector/list style interfaces because this elementary char type is not a simple enumeration to be treated atomically. Operations would actually need to frequently look inside it (e.g., replace base codepoints irrespective of the compositional codepoints). Cheers! -Tyson

"Tyson" == Tyson Whitehead
writes:
Tyson> Given that not all combined characters can be reduced to a Tyson> single code point (from your first paragraph), it would seem Tyson> that MacOS normalization has a conceptual advantage over Tyson> Windows normalization. I don't think it is O/S specific. There are different normalization schemes available to all O/Ses. The four most-standard ones are NFC, NFD, NFKC and NFKD. Here C standards for compmosition, D stands for Decomposition, and K stands for Kompatibility. Different applications have different needs. -- Colin Adams Preston Lancashire () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

On Wed, 2010-10-20 at 11:11 -0400, Tyson Whitehead wrote:
On October 19, 2010 19:35:33 Duncan Coutts wrote:
Right, that's a very common misunderstanding of Unicode. A Unicode code point (type Char) does not correspond 1:1 with the human notion of a character. It would be nice if it did, but unfortunately it is not something we can ignore. Because of this it is better not to think of operations on individual Chars but on short sequences of Chars. In any case, when processing text (even ASCII where Chars do match characters) many of the most common operations that you want are substring not element based.
I read the wikipedia article on code points, but still do not feel I have a firm grasp as to what exactly you are referring to.
If you have a few minutes, would you mind providing a short example to clarify this with a specific example (e.g., a specific code point that gives issues with a 1:1 model and what those issues are).
Combining characters are the major one. These are things like accents, but there are many more of them in some other languages. For most of the European languages there are both all-in-one code points that combine the base character with the extra mark (because those already existed in previous character sets), but for many other languages the canonical form is made up of multiple code points (and not necessarily just 2). So if you're searching for a particular "character" then searching for a single Char is not sufficient, you need to search for a short sequence of Chars. Duncan

Duncan wrote about Unicode searching:
So if you're searching for a particular "character" then searching for a single Char is not sufficient, you need to search for a short sequence of Chars.
Or even, if you haven't normalised the text you are searching in, you need to search for a small set of short sequences of Char. Wolfram
participants (14)
-
Axel Simon
-
Brandon S Allbery KF8NH
-
Bryan O'Sullivan
-
Colin Paul Adams
-
Daniel Peebles
-
Duncan Coutts
-
Ian Lynagh
-
Johan Tibell
-
John Lato
-
kahl@cas.mcmaster.ca
-
Roman Leshchinskiy
-
Ross Paterson
-
Tyson Whitehead
-
wren ng thornton