
From Greg Weber, *Fri Mar 23 14:24:24 CET 2012:*
This proposal doesn't have to break any codebases. One possibility is to add the Text type to the standard while keeping String and marking it as deprecated.
I for one would like to go on the record as being against any notion of "deprecating" String. Text is good for many use-cases, but it has very different* asymptotic* behavior for many operations, and cannot serve as a one-size-fits-all replacement for String. I am less strongly against adding Text to the standard, but mostly because I am leery in that it brings a lot of language specification baggage and risks making attempts to explore the design space around Text harder. (e.g. Jasper's GSoC project this last year to investigate using UTF-8 encoded Data.Text would have been a much harder sell if Text was codified in the language standard!) Text has a remarkably (necessarily) large API and for the first time, as far as I can see, the specification would be largely requiring users to use qualified imports to access everything inside of it, which complicates the language standard from a pedagogical standpoint. Isn't it enough that it is part of the platform? -Edward

On Fri, Mar 23, 2012 at 13:05, Edward Kmett
Isn't it enough that it is part of the platform?
As long as the entire Prelude and large chunks of the bootlibs are based around String, String will be preferred. String as a boxed singly-linked list type is therefore a major problem. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

Like I said, my objection to including Text is a lot less strong than my feelings on any notion of deprecating String.
However, I still see a potentially huge downside from an pedagogical perspective to pushing Text, especially into a place where it will be front and center to new users. String lets the user learn about induction, and encourages a "Haskelly" programming style, where you aren't mucking about with indices and Builders everywhere, which is frankly very difficult to use when building Text. If you cons or append to build up a Text fragment, frankly you're doing it wrong.
The pedagogical concern is quite real, remember many introductory lanuage classes have time to present Haskell and the list data type and not much else. Showing parsing through pattern matching on strings makes a very powerful tool, its harder to show that with Text.
But even when taking apart Text, the choice of UTF16 internally makes it pretty much a worst case for many string manipulation purposes. (e.g. slicing has to spend linear time scanning the string) due to the existence of codepoints outside of plane 0.
The major benefits of Text come from FFI opportunities, but even there if you dig into its internals it has to copy out of the array to talk to foreign functions because it lives in unpinned memory unlike ByteString.
The workarounds for these limitations all require access to the internals, so a Text proposed in an implementation-agnostic manner is less than useful, and one supplied with a rigid set of implementation choices seems to fossilize the current design.
All of these things make me lean towards a position that it is premature to push Text as the one true text representation.
That I am very sympathetic to the position that the standard should ensure that there are Text equivalents for all of the exposed string operations, like read, show, etc, and the various IO primitives, so that a user who is savvy to all of these concerns has everything he needs to make his code perform well.
Sent from my iPad
On Mar 23, 2012, at 1:32 PM, Brandon Allbery
On Fri, Mar 23, 2012 at 13:05, Edward Kmett
wrote: Isn't it enough that it is part of the platform? As long as the entire Prelude and large chunks of the bootlibs are based around String, String will be preferred. String as a boxed singly-linked list type is therefore a major problem.
-- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

OK, so I think we should separate the parts of the proposal a bit.
- Remove type String = [Char]
- Make String an abstract type (it could be named Text to encourage
users to think about whether they are operating on a representation
of text or on a sequence of bytes).
- Specify operations on such an abstract String/Text type.
Personally, I think the standard shouldn't specify too many operations
over such a type to not limit implementors' freedom too much.
- Integrate the rest of the standard library with this new abstract
type. This, I think, is actually the hardest part.
I think most here agree that the main advantage of the current
definition is only pedagogical. Even then Strings are often built in
a very inefficient way by using ++ instead of ShowS + function
composition (which actually is a builder on its own). I don't really
think that having an abstract type is such a big problem for teaching.
You can do string processing by doing (pack . myfunction . unpack)
which is fine for this purpose. Once students are comfortable with
using higher-order functions, you can tell them to use the more
optimised Text-specific combinators. Builders are also a very nice
application of monoids.
The larger problem for the Prelude would be that you can no longer use
the list functions on String/Text. This mainly leads to an issue with
naming things (e.g., length for lists and length for strings).
Similarly, file functions like readFile probably shouldn't return Text
but ByteStrings. But that would mean making ByteString part of the
Prelude as well. So I'm not too sure on these particular issues.
On 23 March 2012 19:30, Edward Kmett
Like I said, my objection to including Text is a lot less strong than my feelings on any notion of deprecating String.
However, I still see a potentially huge downside from an pedagogical perspective to pushing Text, especially into a place where it will be front and center to new users. String lets the user learn about induction, and encourages a "Haskelly" programming style, where you aren't mucking about with indices and Builders everywhere, which is frankly very difficult to use when building Text. If you cons or append to build up a Text fragment, frankly you're doing it wrong.
The pedagogical concern is quite real, remember many introductory lanuage classes have time to present Haskell and the list data type and not much else. Showing parsing through pattern matching on strings makes a very powerful tool, its harder to show that with Text.
But even when taking apart Text, the choice of UTF16 internally makes it pretty much a worst case for many string manipulation purposes. (e.g. slicing has to spend linear time scanning the string) due to the existence of codepoints outside of plane 0.
The major benefits of Text come from FFI opportunities, but even there if you dig into its internals it has to copy out of the array to talk to foreign functions because it lives in unpinned memory unlike ByteString.
The workarounds for these limitations all require access to the internals, so a Text proposed in an implementation-agnostic manner is less than useful, and one supplied with a rigid set of implementation choices seems to fossilize the current design.
All of these things make me lean towards a position that it is premature to push Text as the one true text representation.
That I am very sympathetic to the position that the standard should ensure that there are Text equivalents for all of the exposed string operations, like read, show, etc, and the various IO primitives, so that a user who is savvy to all of these concerns has everything he needs to make his code perform well.
Sent from my iPad
On Mar 23, 2012, at 1:32 PM, Brandon Allbery
wrote: On Fri, Mar 23, 2012 at 13:05, Edward Kmett
wrote: Isn't it enough that it is part of the platform?
As long as the entire Prelude and large chunks of the bootlibs are based around String, String will be preferred. String as a boxed singly-linked list type is therefore a major problem.
-- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
-- Push the envelope. Watch it bend.

Note that this might be a good time to consider re-factoring the list
operations, for example, making ++ operate on monoids instead of just
lists. I think the 'naming issue' that you mention highlights the need for
better use of type classes in the prelude.
On Fri, Mar 23, 2012 at 1:03 PM, Thomas Schilling
OK, so I think we should separate the parts of the proposal a bit.
- Remove type String = [Char]
- Make String an abstract type (it could be named Text to encourage users to think about whether they are operating on a representation of text or on a sequence of bytes).
- Specify operations on such an abstract String/Text type. Personally, I think the standard shouldn't specify too many operations over such a type to not limit implementors' freedom too much.
- Integrate the rest of the standard library with this new abstract type. This, I think, is actually the hardest part.
I think most here agree that the main advantage of the current definition is only pedagogical. Even then Strings are often built in a very inefficient way by using ++ instead of ShowS + function composition (which actually is a builder on its own). I don't really think that having an abstract type is such a big problem for teaching. You can do string processing by doing (pack . myfunction . unpack) which is fine for this purpose. Once students are comfortable with using higher-order functions, you can tell them to use the more optimised Text-specific combinators. Builders are also a very nice application of monoids.
The larger problem for the Prelude would be that you can no longer use the list functions on String/Text. This mainly leads to an issue with naming things (e.g., length for lists and length for strings). Similarly, file functions like readFile probably shouldn't return Text but ByteStrings. But that would mean making ByteString part of the Prelude as well. So I'm not too sure on these particular issues.
On 23 March 2012 19:30, Edward Kmett
wrote: Like I said, my objection to including Text is a lot less strong than my feelings on any notion of deprecating String.
However, I still see a potentially huge downside from an pedagogical perspective to pushing Text, especially into a place where it will be front and center to new users. String lets the user learn about induction, and encourages a "Haskelly" programming style, where you aren't mucking about with indices and Builders everywhere, which is frankly very difficult to use when building Text. If you cons or append to build up a Text fragment, frankly you're doing it wrong.
The pedagogical concern is quite real, remember many introductory lanuage classes have time to present Haskell and the list data type and not much else. Showing parsing through pattern matching on strings makes a very powerful tool, its harder to show that with Text.
But even when taking apart Text, the choice of UTF16 internally makes it pretty much a worst case for many string manipulation purposes. (e.g. slicing has to spend linear time scanning the string) due to the existence of codepoints outside of plane 0.
The major benefits of Text come from FFI opportunities, but even there if you dig into its internals it has to copy out of the array to talk to foreign functions because it lives in unpinned memory unlike ByteString.
The workarounds for these limitations all require access to the internals, so a Text proposed in an implementation-agnostic manner is less than useful, and one supplied with a rigid set of implementation choices seems to fossilize the current design.
All of these things make me lean towards a position that it is premature to push Text as the one true text representation.
That I am very sympathetic to the position that the standard should ensure that there are Text equivalents for all of the exposed string operations, like read, show, etc, and the various IO primitives, so that a user who is savvy to all of these concerns has everything he needs to make his code perform well.
Sent from my iPad
On Mar 23, 2012, at 1:32 PM, Brandon Allbery
wrote: On Fri, Mar 23, 2012 at 13:05, Edward Kmett
wrote: Isn't it enough that it is part of the platform?
As long as the entire Prelude and large chunks of the bootlibs are based around String, String will be preferred. String as a boxed singly-linked list type is therefore a major problem.
-- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364vm/sms
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
-- Push the envelope. Watch it bend.
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime

On Fri, Mar 23, 2012 at 16:21, Nate Soares
I think the 'naming issue' that you mention highlights the need for better use of type classes in the prelude.
...which is a rat's nest best avoided, unfortunately, unless the idea is to stifle it entirely. (How long have people been proposing alternatives with no net effect?) -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

On Fri, Mar 23, 2012 at 4:21 PM, Nate Soares
Note that this might be a good time to consider re-factoring the list operations, for example, making ++ operate on monoids instead of just lists.
Note: we have (<>) for Monoid, which was deliberately chosen rather than generalizing (++) because that had already changed meaning from applying to MonadPlus to the more restricted type during what I tend to refer to as the "great momomorphization revolution of 1998", and not every MonadPlus is compatible with the corresponding monoid (Maybe comes to mind). This also entails moving Data.Monoid into the standard.
I think the 'naming issue' that you mention highlights the need for better use of type classes in the prelude.
The major issue with typeclasses for things like special-purpose containers is that they almost inevitably wind up requiring multiparameter type classes with functional dependencies, or they need type families. This then runs afoul of the fact that since neither one is better than the other for all usecases, neither one seems to be drifting any closer to standardization. -Edward

Hi all, Thomas Schilling wrote:
I think most here agree that the main advantage of the current definition is only pedagogical.
But that in itself is not a small deal. In fact, it's a pretty major advantage. Moreover, the utter simplicity of String = [Char] is a benefit in its own right. Let's not forget that this, in practice, across all Haskell applications, works just fine in the vast majority of cases. I get the sense that the proponents for deprecating, and ultimately get rid of, String = [Char], are suggesting that this would lead to noticeable performance improvements across the board by virtue of preventing programmers from accidentally making a poor choice of data structure for representing string. But I conjecture that the performance impact of switching form e.g. String to Text at the level of complete applications would be negligible in most cases, simply because most Haskell applications are not dominated by heavy-duty string processing. And those that are, probably already uses something like Text, and were written be people who know a thing or two about appropriate choice of data structures anyway. As to teaching:
I don't really think that having an abstract type is such a big problem for teaching. You can do string processing by doing (pack . myfunction . unpack)
Here at Nottingham, we're teaching all our 1st-year undergraduates Haskell. It works, but it is a challenge, and, alas, far from everyone "gets" it. And this is despite the module being taught by one of the leading and most experienced Haskell educators (and text book author), Graham Hutton. Without starting an endless discussion about how to best teach programming languages in general and Haskell in particular to (near) beginners, I dare say that idioms like the one suggested above would do nothing to help. String != [Char] would break no end of code, text books, tutorials, lecture slides, would not help with teaching Haskell, all for very little if any benefit in the grand scheme of things. So let's not go there. On the other hand, a standardised, well thought-out, API for high-performance strings and appropriate mechanisms such as a measure of overloading to make it easy and palatable to use, and that work alongside the present String = [Char], would be a good thing. All the best, /Henrik -- Henrik Nilsson School of Computer Science The University of Nottingham nhn@cs.nott.ac.uk

On 24 March 2012 12:53, Henrik Nilsson
Hi all,
Thomas Schilling wrote:
I think most here agree that the main advantage of the current definition is only pedagogical.
But that in itself is not a small deal. In fact, it's a pretty major advantage.
Moreover, the utter simplicity of String = [Char] is a benefit in its own right. Let's not forget that this, in practice, across all Haskell applications, works just fine in the vast majority of cases.
I get the sense that the proponents for deprecating, and ultimately get rid of, String = [Char], are suggesting that this would lead to noticeable performance improvements across the board by virtue of preventing programmers from accidentally making a poor choice of data structure for representing string. But I conjecture that the performance impact of switching form e.g. String to Text at the level of complete applications would be negligible in most cases, simply because most Haskell applications are not dominated by heavy-duty string processing. And those that are, probably already uses something like Text, and were written be people who know a thing or two about appropriate choice of data structures anyway.
As to teaching:
I don't really think that having an abstract type is such a big problem for teaching. You can do string processing by doing (pack . myfunction . unpack)
Here at Nottingham, we're teaching all our 1st-year undergraduates Haskell. It works, but it is a challenge, and, alas, far from everyone "gets" it. And this is despite the module being taught by one of the leading and most experienced Haskell educators (and text book author), Graham Hutton.
Without starting an endless discussion about how to best teach programming languages in general and Haskell in particular to (near) beginners, I dare say that idioms like the one suggested above would do nothing to help.
String != [Char] would break no end of code, text books, tutorials, lecture slides, would not help with teaching Haskell, all for very little if any benefit in the grand scheme of things.
OK, I agree that breaking text books is a big deal. On the other hand, the lack of a good Text data type forced text books to teach bad approaches to dealing with strings. Haskell should do better. Johan mentioned both semantic and performance problems with Strings. A part he didn't stress is that Strings are also a horribly memory-inefficient way of storing strings. On 64 bit GHC systems a single ASCII character needs 16 bytes of memory (i.e., an overhead of 16x). A non-ASCII character (ord c > 255) actually requires 32 bytes. (This is due to a de-duplication optimisation in the GHC GC). Other implementations may do better, but an abstract type would still be better to enable more freedom for implementors. Correct handling of unicode strings is a Hard Problem and String = [Char] is only better if you ignore all the issues (which is certainly fine a teaching environment). I would be happy to have a simplistic String = [Char] coexist with a Text type if it weren't for the problem that so many things are biased towards String. E.g., error takes a String, Show is used everywhere and produces strings, the pretty printing library uses Strings, Read parses Strings.
On the other hand, a standardised, well thought-out, API for high-performance strings and appropriate mechanisms such as a measure of overloading to make it easy and palatable to use, and that work alongside the present String = [Char], would be a good thing.
As I said, while I'm not a huge fan of having two String types co-exist, I could accept it as a necessary trade-off to keep text books valid and preserve backwards compatibility. (There are also other issues with String. For example, you can't write an instance MyClass String in Haskell2010, and even with GHC extensions it seems wrong and you often end up writing instances that overlap with MyClass [a].) I'm using Data.Text a lot, so I can work around the issue, but unfortunately you run into a lot of issues where the standard library forces the use of String, and that, I believe, is wrong. If changing the standard library is the bigger issue, however, then I'm not sure whether this discussion needs to take place on the haskell-prime list or on the libraries list. / Thomas

Hi all, Thomas Schilling wrote:
OK, I agree that breaking text books is a big deal. On the other hand, the lack of a good Text data type forced text books to teach bad approaches to dealing with strings. Haskell should do better.
As far as I know, none of the introductory Haskell text books has the ambition of teaching serious text processing in Haskell. And what they do for simple text processing for purpose of illustration is no worse than what one typically would do in, say, an introduction to programming using any language, like C or Java. So I don't buy that argument per se. But I do agree, of course, that a good library for text processing, and with adequate language support for making it convenient to use, is important.
Johan mentioned both semantic and performance problems with Strings. A part he didn't stress is that Strings are also a horribly memory-inefficient way of storing strings. On 64 bit GHC systems a single ASCII character needs 16 bytes of memory (i.e., an overhead of 16x). A non-ASCII character (ord c > 255) actually requires 32 bytes. (This is due to a de-duplication optimisation in the GHC GC). Other implementations may do better, but an abstract type would still be better to enable more freedom for implementors.
Sure it's inefficient. I doubt the above is news to anyone on this list. The point, though, is that once we're at the level of applications, in most cases, this inefficiency is negligible. And in the cases where it is not, the programmer will be well aware of this and pick a better representation, or will learn about it the hard way and be forced to pick a better representation. Just as with processing of significant amounts of *any* data. It simply isn't the case that the Haskell world magically would be significantly better of in terms of performance of only everyone was forced to use something like Text instead of String = [Char]. Moreover, the above analysis is unnecessarily pessimistic for one (somewhat important case: string literals. Thanks to Haskell being lazy, it is very easy if one really worry (for an implementor) to arrange that string literals are stored very compactly in a binary, only to be materialized when (and if) actually used. (I did just that years ago in the Freja compiler: memory was significantly smaller in those days, so I did worry :-)
Correct handling of unicode strings is a Hard Problem and String = [Char] is only better if you ignore all the issues (which is certainly fine a teaching environment).
Yes. Unicode is unfortunately (partly but not exclusively out of necessity), very complicated. I doubt one would want to discuss this in depth in any introductory programming course. My point was that String = [Char] is fine as far as it goes. Not that it should be the basis for serious string processing libraries.
I would be happy to have a simplistic String = [Char] coexist with a Text type if it weren't for the problem that so many things are biased towards String. E.g., error takes a String,
Yes. That's a bias. But is it a problem? Here we're just talking about getting a sequence of (possibly unicode) characters to stderr.
Show is used everywhere and produces strings,
Show and Read are mainly used for simplistic serialisation and deserialisation. When ppl really care, they tend to use more refined approaches, e.g. proper scanners and parsers, or binary I/O. So again, while there is certainly a bias, it doesn't seem like a genuine problem in most cases. I can possibly see issues for conversion from and to e.g. built-in numeric types and various string representations, but I can't see why solving those would necessitate getting rid of String = [Char]. Read and Show could be overloaded on the string type, for example (at least given multi-parameter type classes), and/or a bit of compiler optimization ought to be enough to dispatch such uses of "read" and "show" to appropriate primitives of e.g. the Text library anyway.
the pretty printing library uses Strings,
But that is a library issue, not a language issue.
Read parses Strings.
See above. The special status of Read and Show is questionable anyway. Will hopefully be possible at some point to implement those completely as libraries. So I'm not not overly swayed by the argument of language bias in those cases.
As I said, while I'm not a huge fan of having two String types co-exist, I could accept it as a necessary trade-off to keep text books valid and preserve backwards compatibility.
While an undue proliferation of string types would be unfortunate, compared with the plethora of other representational choices one is faced with when it comes to e.g. numeric types, arrays, maps, etc., a couple of string types doesn't seem like a big deal, especially not if one is designated the default choice for any program that will do non-trivial text processing or aims at doing internationalisation properly.
(There are also other issues with String. For example, you can't write an instance MyClass String in Haskell2010, and even with GHC extensions it seems wrong and you often end up writing instances that overlap with MyClass [a].) I'm using Data.Text a lot, so I can work around the issue, but unfortunately you run into a lot of issues where the standard library forces the use of String, and that, I believe, is wrong.
If changing the standard library is the bigger issue, however, then I'm not sure whether this discussion needs to take place on the haskell-prime list or on the libraries list.
Indeed. Maybe all that's really needed at the language level is to standardize overloading of string literals? (In a way that avoids issues like the ones described above.) Best, /Henrik -- Henrik Nilsson School of Computer Science The University of Nottingham nhn@cs.nott.ac.uk

Edward Kmett wrote:
Like I said, my objection to including Text is a lot less strong than my feelings on any notion of deprecating String.
[..]
The pedagogical concern is quite real, remember many introductory lanuage classes have time to present Haskell and the list data type and not much else. Showing parsing through pattern matching on strings makes a very powerful tool, its harder to show that with Text.
[..]
The major benefits of Text come from FFI opportunities, but even there if you dig into its internals it has to copy out of the array to talk to foreign functions because it lives in unpinned memory unlike ByteString.
I agree with Edward Kmett on the virtues of String = [Char] for learning Haskell. I'm teaching beginners regularly and it is simply eye-opening for them that they can use the familiar list operations to solve real world problems which usually involve textual data. Which brings me to the fundamental question behind this proposal: Why do we need Text at all? What are its virtues and how do they compare? What is the trade-off? (I'm not familiar enough with the Text library to answer these.) To put it very pointedly: is a %20 performance increase on the current generation of computers worth the cost in terms of ease-of-use, when the performance can equally be gained by buying a faster computer or more RAM? I'm not sure whether I even agree with this statement, but this is the trade-off we are deciding on. Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Hi all,
On Sat, Mar 24, 2012 at 12:39 AM, Heinrich Apfelmus
Which brings me to the fundamental question behind this proposal: Why do we need Text at all? What are its virtues and how do they compare? What is the trade-off? (I'm not familiar enough with the Text library to answer these.)
To put it very pointedly: is a %20 performance increase on the current generation of computers worth the cost in terms of ease-of-use, when the performance can equally be gained by buying a faster computer or more RAM? I'm not sure whether I even agree with this statement, but this is the trade-off we are deciding on.
Correctness ========== Using list-based operations on Strings are almost always wrong, as soon as you move away from English text. You almost always have to deal with Unicode strings as blobs, considering several code points at once. For example, upcase :: String -> String upcase = map toUpper Is terse, beautiful, and wrong, as several languages map a single lowercase character to two uppercase characters (as I'm sure you're aware.) Perhaps this is OK to ignore when teaching students Haskell, but it really hurts those who want to use Haskell as an engineering language. Performance =========== Depending on the benchmark, the difference can be much bigger than 20%. For example, here's a comparison of decoding UTF-8 byte data into a String vs a Text value: benchmarking Pure/decode/Text mean: 50.22202 us, lb 50.08306 us, ub 50.37669 us, ci 0.950 std dev: 751.1139 ns, lb 666.2243 ns, ub 865.8246 ns, ci 0.950 variance introduced by outliers: 7.553% variance is slightly inflated by outliers benchmarking Pure/decode/String mean: 188.0507 us, lb 187.4970 us, ub 188.6955 us, ci 0.950 std dev: 3.053076 us, lb 2.647318 us, ub 3.606262 us, ci 0.950 variance introduced by outliers: 9.407% variance is slightly inflated by outliers A difference of almost 4x. Many of the Text vs String benchmarks measure the performance of operations ignoring both decoding and encoding, while any real application would have to do both. On top of that, String is more or less as optimized as it can be; benchmarks are almost completely memory bound. Text on the other hand still has potential of (large) improvements, as GHC doesn't general optimal code for tight loops over arrays. For example, we know that GHC generates bad code for decodeUtf8 as used by Text's stream fusion, hurting any code that uses fusion. Furthermore, the memory overhead of Text is smaller, which means that applications that hold on to many string value will use less heap and thus experience smaller "freezes" due major GC collections, which are linear in the heap size. Cheers, Johan

Hi Johan, On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote:
On Sat, Mar 24, 2012 at 12:39 AM, Heinrich Apfelmus
wrote: Which brings me to the fundamental question behind this proposal: Why do we need Text at all? What are its virtues and how do they compare? What is the trade-off? (I'm not familiar enough with the Text library to answer these.)
To put it very pointedly: is a %20 performance increase on the current generation of computers worth the cost in terms of ease-of-use, when the performance can equally be gained by buying a faster computer or more RAM? I'm not sure whether I even agree with this statement, but this is the trade-off we are deciding on.
Correctness ==========
Using list-based operations on Strings are almost always wrong
Data.Text seems to think that many of them are worth reimplementing for Text. It looks like someone's systematically gone through Data.List. And in fact, very few functions there /don't/ look like they are directly equivalent to list functions.
, as soon as you move away from English text. You almost always have to deal with Unicode strings as blobs, considering several code points at once. For example,
upcase :: String -> String upcase = map toUpper
This is no more incorrect than upcase = Data.Text.map toUpper There's no reason that there couldn't be a Data.String.toUpper corresponding to Data.Text.toUpper.
Performance ===========
Depending on the benchmark, the difference can be much bigger than 20%. For example, here's a comparison of decoding UTF-8 byte data into a String vs a Text value:
I think Heinrich meant 20% performance in a useful program, not a micro-benchmark. Thanks Ian

On 24 March 2012 20:16, Ian Lynagh
Hi Johan,
On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote:
On Sat, Mar 24, 2012 at 12:39 AM, Heinrich Apfelmus
wrote: Which brings me to the fundamental question behind this proposal: Why do we need Text at all? What are its virtues and how do they compare? What is the trade-off? (I'm not familiar enough with the Text library to answer these.)
To put it very pointedly: is a %20 performance increase on the current generation of computers worth the cost in terms of ease-of-use, when the performance can equally be gained by buying a faster computer or more RAM? I'm not sure whether I even agree with this statement, but this is the trade-off we are deciding on.
Correctness ==========
Using list-based operations on Strings are almost always wrong
Data.Text seems to think that many of them are worth reimplementing for Text. It looks like someone's systematically gone through Data.List.
That's exactly what happened as part of the platform inclusion process. In fact, there was quite a bit of bike shedding whether the Text API should be compatible with the list API or not. In the end the decision was made to add all the list functions even if that encouraged running into unicode issues. I'm pretty sure you participated in that discussion.
Performance ===========
Depending on the benchmark, the difference can be much bigger than 20%. For example, here's a comparison of decoding UTF-8 byte data into a String vs a Text value:
I think Heinrich meant 20% performance in a useful program, not a micro-benchmark.
Generating web sites is a huge application area of Haskell and one where a proper text type is in no way a micro optimisation.

On Sat, Mar 24, 2012 at 08:38:23PM +0000, Thomas Schilling wrote:
On 24 March 2012 20:16, Ian Lynagh
wrote: Correctness ==========
Using list-based operations on Strings are almost always wrong
Data.Text seems to think that many of them are worth reimplementing for Text. It looks like someone's systematically gone through Data.List.
That's exactly what happened as part of the platform inclusion process. In fact, there was quite a bit of bike shedding whether the Text API should be compatible with the list API or not. In the end the decision was made to add all the list functions even if that encouraged running into unicode issues. I'm pretty sure you participated in that discussion.
As far as I remember, a few functions were added to text and bytestring during that, but mostly the discussion was about naming. Even in the first 0.1 release of bytestring: http://hackage.haskell.org/packages/archive/text/0.1/doc/html/Data-Text.html there is a large amount of Data.List covered, e.g. map, transpose, foldl1', minimum, mapAccumR, groupBy. Thanks Ian

To add my tuppence-worth on this, addressed to no-one in particular:
(1) I think getting hung up on UTF-8 correctness is a distraction here. I
can't imagine anyone suggesting that the C/C++ standards removed support
for (char*) because it wasn't UTF-8 correct: sure, you'd recommend people
use a different type when it matters, but the language standard itself
shouldn't be driven by technical issues that don't affect most people most
of the time. I'm sure it's good engineering practice to worry about these
things, but the standard isn't there to encourage good engineering practice.
(2) I'd suggest that a proposal that advocated overloaded string literals
-- of which [Char] was an option -- couldn't be much more confusing from a
pedagogical perspective than the fact that numeric literals are overloaded.
Since that seems to be one of the main biases in favour of [Char] in the
current standard, that might be a possible incremental fix.
Best,
Freddie
On 24 March 2012 22:15, Ian Lynagh
On Sat, Mar 24, 2012 at 08:38:23PM +0000, Thomas Schilling wrote:
On 24 March 2012 20:16, Ian Lynagh
wrote: Correctness ==========
Using list-based operations on Strings are almost always wrong
Data.Text seems to think that many of them are worth reimplementing for Text. It looks like someone's systematically gone through Data.List.
That's exactly what happened as part of the platform inclusion process. In fact, there was quite a bit of bike shedding whether the Text API should be compatible with the list API or not. In the end the decision was made to add all the list functions even if that encouraged running into unicode issues. I'm pretty sure you participated in that discussion.
As far as I remember, a few functions were added to text and bytestring during that, but mostly the discussion was about naming.
Even in the first 0.1 release of bytestring:
http://hackage.haskell.org/packages/archive/text/0.1/doc/html/Data-Text.html there is a large amount of Data.List covered, e.g. map, transpose, foldl1', minimum, mapAccumR, groupBy.
Thanks Ian
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime

On 24 March 2012 22:33, Freddie Manners
To add my tuppence-worth on this, addressed to no-one in particular:
(1) I think getting hung up on UTF-8 correctness is a distraction here. I can't imagine anyone suggesting that the C/C++ standards removed support for (char*) because it wasn't UTF-8 correct: sure, you'd recommend people use a different type when it matters, but the language standard itself shouldn't be driven by technical issues that don't affect most people most of the time. I'm sure it's good engineering practice to worry about these things, but the standard isn't there to encourage good engineering practice.
It doesn't really have anything to do with UTF-8. UTF-8 is just a particular serialisation of a unicode string. Here's a simple illustration of the problems one faces: Let's say you want to search for the string "fix". Now, the problem is that the sequence 'f','i' could be represented both as ['f', 'i'] or as [chr 0xfb01] (the "fi" ligature). The text-icu package provides a function to normalise a string such that only one of these forms can occur in each string. Because the world's languages are rather complex there are many more such cases which need to be handled properly (if you don't want to run into weird corner cases).
(2) I'd suggest that a proposal that advocated overloaded string literals -- of which [Char] was an option -- couldn't be much more confusing from a pedagogical perspective than the fact that numeric literals are overloaded. Since that seems to be one of the main biases in favour of [Char] in the current standard, that might be a possible incremental fix.
I agree that this proposal should probably include the standardisation of the OverloadedStrings extension.
Best, Freddie
On 24 March 2012 22:15, Ian Lynagh
wrote: On Sat, Mar 24, 2012 at 08:38:23PM +0000, Thomas Schilling wrote:
On 24 March 2012 20:16, Ian Lynagh
wrote: Correctness ==========
Using list-based operations on Strings are almost always wrong
Data.Text seems to think that many of them are worth reimplementing for Text. It looks like someone's systematically gone through Data.List.
That's exactly what happened as part of the platform inclusion process. In fact, there was quite a bit of bike shedding whether the Text API should be compatible with the list API or not. In the end the decision was made to add all the list functions even if that encouraged running into unicode issues. I'm pretty sure you participated in that discussion.
As far as I remember, a few functions were added to text and bytestring during that, but mostly the discussion was about naming.
Even in the first 0.1 release of bytestring:
http://hackage.haskell.org/packages/archive/text/0.1/doc/html/Data-Text.html there is a large amount of Data.List covered, e.g. map, transpose, foldl1', minimum, mapAccumR, groupBy.
Thanks Ian
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
-- Push the envelope. Watch it bend.

On Sat, Mar 24, 2012 at 3:33 PM, Freddie Manners
To add my tuppence-worth on this, addressed to no-one in particular:
(1) I think getting hung up on UTF-8 correctness is a distraction here. I can't imagine anyone suggesting that the C/C++ standards removed support for (char*) because it wasn't UTF-8 correct: sure, you'd recommend people use a different type when it matters, but the language standard itself shouldn't be driven by technical issues that don't affect most people most of the time. I'm sure it's good engineering practice to worry about these things, but the standard isn't there to encourage good engineering practice.
(I assume you mean Unicode correctness. UTF-8 is only one possible encoding. Also I'm not arguing for removing type String = [Char], I arguing why Text is better than String.) C++'s char* is morally equivalent of our ByteString, not Text. There's no standardized C++ Unicode string type, ICU's UnicodeString is perhaps the closest to one. -- Johan

On Sat, Mar 24, 2012 at 6:00 PM, Johan Tibell
C++'s char* is morally equivalent of our ByteString, not Text. There's no standardized C++ Unicode string type, ICU's UnicodeString is perhaps the closest to one.
Hmm, std::u16string, std::u23string, and std::wstring are C++ standard types to process Unicode texts. Anyway, my inclination is that having a proper string in Haskell type would be a Good Thing. Sometimes it is worth breaking the textbook. In our local Haskell system for AVR microcontrollers, we explicitly made String distinct from [Char] -- we cannot afford the memory inefficiency that [Char] entails, just to represent simple strings. -- Gaby

On Sat, Mar 24, 2012 at 4:42 PM, Gabriel Dos Reis
Hmm, std::u16string, std::u23string, and std::wstring are C++ standard types to process Unicode texts.
Note that at least u16string is too small to encode all of Unicode and wstring might be as 16 bits is not enough to encode all of Unicode. -- Johan

On Sat, Mar 24, 2012 at 7:16 PM, Johan Tibell
On Sat, Mar 24, 2012 at 4:42 PM, Gabriel Dos Reis
wrote: Hmm, std::u16string, std::u23string, and std::wstring are C++ standard types to process Unicode texts.
Note that at least u16string is too small to encode all of Unicode and wstring might be as 16 bits is not enough to encode all of Unicode.
I think there is a confusion here. A Unicode character is an abstract entity. For it to exist in some concrete form in a program, you need an encoding. The fact that char16_t is 16-bit wide is irrelevant to whether it can be used in a representation of a Unicode text, just like uint8_t (e.g. 'unsigned char') can be used to encode Unicode string despite it being only 8-bit wide. You do not need to make the character type exactly equal to the type of the individual element in the text representation. Now, if you want to make a one-to-one correspondence between individual elements in a std::basic_string and a Unicode character, you would of course go for char32_t, which might be wasteful depending on the circumstances. Text processing languages like Perl have long decided to de-emphasize one-character-at-a-time processing. For most common cases, it is just inefficient. But, I also understand that the efficiency argument may not be strong in the context of Haskell. However, I believe a particular attention must be paid to the correctness of the semantics. Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient as far as text processing goes; you also need a localization at the minimum. It is the combination of the two that gives some meaning to text representation and operations. I have been following the discussion, but I don't see anything said about locales. -- Gaby

Can we all agree that * Text can now demonstrate both CPU and RAM performance improvements in benchmarks. Because Text is an opaque type it has a maximum potential for future performance improvements. Declaring a String to be a list limits performance improvements * In a Unicode world, String = [Char] is not always correct: instead for some operations one must operate on the String as a whole. Using a [Char] type makes it much more likely for a programmer to mistakenly operate on individual characters. Using a Text type allows us to choose to not expose character manipulation functions. * The usage of String in the base libraries will continue as long as Text is not in the language standard. This will continue to make writing Haskell code a greater chore than is necessary: converting between types, and working around the inconvenience of defining typeclasses that operate on both String and []. These are important enough to *try* to include Text into the standard, even if there are objections to how it might practically be included.

On Sat, Mar 24, 2012 at 5:54 PM, Gabriel Dos Reis
I think there is a confusion here. A Unicode character is an abstract entity. For it to exist in some concrete form in a program, you need an encoding. The fact that char16_t is 16-bit wide is irrelevant to whether it can be used in a representation of a Unicode text, just like uint8_t (e.g. 'unsigned char') can be used to encode Unicode string despite it being only 8-bit wide. You do not need to make the character type exactly equal to the type of the individual element in the text representation.
Well, if you have a >21-bit type you can declare its value to be a Unicode code point (which are numbered.) Using a char* that you claim contain utf-8 encoded data is bad for safety, as there is no guarantee that that's indeed the case.
Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient as far as text processing goes; you also need a localization at the minimum. It is the combination of the two that gives some meaning to text representation and operations.
text does that via ICU. Some operations would be possible without using the locale, if it wasn't for those Turkish i:s. :/ -- Johan

# Switching to Text by default makes us embarrassed! Problem: we want to write beautiful (and possibly inefficient) code that is easy to explain. If nothing else, this is pedagologically important. The goals of this code are to: * use list processing pattern matching and functions on a string type * avoid embarassing name clashes and the need for qualified names (T.split, etc) The second point is Haskell's festering language design sore rearing its ugly head. Lets note that the current state of Haskell is not any more beautiful than what will happen after this proposal is implemented. It is just that we currently have partly hidden away a deficiency in Haskell by only exporting list functions in the Prelude. So our real goal is to come up with conventions and/or hacks that will allow us to continue to hide this deficiency of Haskell for the purposes of pedagogy. If you can't tell, IMHO the issue we are circumventing is Haskell's biggest issue form a laguage design perspective. It is a shame that SPJ's TDNR proposal was shouted down and no alternative has been given. But I am not going to hold out hope that this issue will be solved any time soon. Just limiting solving this to records has proved very difficult. So onto our hacks for making Text the default string type! ## Option 1: T. prefixing using Text functions still requires the T. prefix For pedagogy, continue to use [Char], but use an OverloadedText extension This is a safe conservative option that puts us in a better place than we are today. It just makes us look strange when we build something into the language that requires a prefix. Of course, we could try to give every Text function a slightly different name than the Prelude list functions, but I think that will make using Haskell more difficult that putting up with prefixes. ## Option 2: TDNR for lists (Prelude) list functions are resolved in a special way. For example, we could have 2 different map functions in scope unqualified: one for lists, and one for Text. The compiler is tasked with resolving whether the type is a list or not and determining the appropriate function. I would much rather add a TDNR construct to the language in a universal way than go down this route. ## Option 3: implicit List typeclass We can operate on Text (and other non-list data structures) using a List typeclass. We have 2 concers: * list pattern matching ('c':string) * requiring the typeclass in the type signature everywhere I think we can extend the compiler to pattern match characters out of Text, so lets move onto the second point. If we don't write type signatures anywhere, we actually won't care about it. However, if we add sparse annotations, we will need a List constraint. listF :: List l => ... This could get tiresome quickly. It makes pedagogy immediately delve into an explanation of typeclasses. A simple solution is to special case the List class. We declare that List is so fundamental to Haskell that requiring the List typeclass is not necessary. The Prelude exports (class List where ...). If a List typeclass function is used, the compiler inserts the List typeclass constraint into a type signature automatically. This option is very attractive because it solves all of our problems at the cost of 1 easy to explain piece of magic. It also makes it possible to unify list behavior across different data types without the hassle of typeclass insertions everywhere.

On Sat, Mar 24, 2012 at 9:09 PM, Greg Weber
# Switching to Text by default makes us embarrassed!
Text processing /is/ quick to embarrassment :-)
Problem: we want to write beautiful (and possibly inefficient) code that is easy to explain. If nothing else, this is pedagologically important. The goals of this code are to: * use list processing pattern matching and functions on a string type
I may have missed this question so I will ask it (apologies if it is a repeat): Why is it believed that list processing pattern matching is appropriate or the right tool for text processing? -- Gaby

On Sat, Mar 24, 2012 at 7:26 PM, Gabriel Dos Reis
On Sat, Mar 24, 2012 at 9:09 PM, Greg Weber
wrote:
Problem: we want to write beautiful (and possibly inefficient) code that is easy to explain. If nothing else, this is pedagologically important. The goals of this code are to: * use list processing pattern matching and functions on a string type
I may have missed this question so I will ask it (apologies if it is a repeat): Why is it believed that list processing pattern matching is appropriate or the right tool for text processing?
Nobody said it is the right tool for text processing. In fact, I think we all agreed it is the wrong tool for many cases. But it is easy for students to understand since they are already being taught to use lists for everything else. It would be great if you can talk with teachers of Haskell and figure out a better way to teach text processing.
-- Gaby

On Sat, Mar 24, 2012 at 10:19 PM, Greg Weber
On Sat, Mar 24, 2012 at 7:26 PM, Gabriel Dos Reis
wrote: On Sat, Mar 24, 2012 at 9:09 PM, Greg Weber
wrote: Problem: we want to write beautiful (and possibly inefficient) code that is easy to explain. If nothing else, this is pedagologically important. The goals of this code are to: * use list processing pattern matching and functions on a string type
I may have missed this question so I will ask it (apologies if it is a repeat): Why is it believed that list processing pattern matching is appropriate or the right tool for text processing?
Nobody said it is the right tool for text processing. In fact, I think we all agreed it is the wrong tool for many cases.
Hmm, I would have thought that would be enough reasons not to use that method -- "wrong methods" are hard to unlearn and to get rid of.
But it is easy for students to understand since they are already being taught to use lists for everything else.
Perhaps we are underestimating their competences and are complicating their lives unnecessarily...
It would be great if you can talk with teachers of Haskell and figure out a better way to teach text processing.
my suspicion is teachers of Haskell would want designers of Haskell to make the good datatype for text the default :-) :-) -- Gaby

On Sun, Mar 25, 2012 at 11:42 AM, Gabriel Dos Reis < gdr@integrable-solutions.net> wrote:
Perhaps we are underestimating their competences and are complicating their lives unnecessarily...
Have you ever actually taught an introductory languages course? If anything we delude ourselves by overestimating the ability of kids just shortly out of highschool to assimilate an entire new worldview in a couple of weeks while they are distracted by other things. Any additional distraction that makes this harder is a serious pain point. Consequently, in my experience, most instructors don't even go outside of the Prelude, except perhaps to introduce simple custom data types that their students define. The goal in that period is to get the students accustomed to non-strictness, do some list processing, and hope that an understanding of well-founded recursion vs. productive corecursion sticks, because these are the things that you can't teach well in another language and which are useful to the student no matter what tools they wind up using in the future. I would rather extra time be spent trying to get the users up to speed on the really interesting and novel parts of the language, such as typeclasses and monads in particular, than lose at least a quarter of my time fiddling about with text processing, a special case API and qualified imports, because those couple of weeks are going to shape many of those students' opinion of the language forever. -Edward

On Sun, Mar 25, 2012 at 2:08 PM, Edward Kmett
On Sun, Mar 25, 2012 at 11:42 AM, Gabriel Dos Reis
wrote: Perhaps we are underestimating their competences and are complicating their lives unnecessarily...
Have you ever actually taught an introductory languages course?
Yes, and Haskell (if you asked); that is part of my daytime job. -- Gaby

On Sun, Mar 25, 2012 at 2:08 PM, Edward Kmett
If anything we delude ourselves by overestimating the ability of kids just shortly out of highschool to assimilate an entire new worldview in a couple of weeks while they are distracted by other things. Any additional distraction that makes this harder is a serious pain point.
We are doing our students no favor, no good, in being condescending to them pretending that they can't handle teaching material that would actually be close real world experience. If we truly believe that they don't have enough time to learn what would really be useful to them, then we are truly wasting their valuable time teaching them things they would have to unlearn before writing good and correct code. The education would have been a complete failure and waste of resources.
Consequently, in my experience, most instructors don't even go outside of the Prelude,
but is that even a good thing?
except perhaps to introduce simple custom data types that their students define.
I would say that does not do students any good, nor does it do justice to the language. If one believes that Haskell is unlike any other "mainstream" language, then it is an opportunity to show that it can handle beautifully and flawlessly some real world problems whose solutions are more involved in other more popular languages. These days, most of the students are strolling around with "smart phones" that have lot of data in form of texts (email, SMS chats, etc.) What better real world data could you find at a cheaper price to get them experiment with? Restricting oneself to purely "academic" exercises, with no practical benefit whatsoever, would only reinforce students' (mis)perception that the language isn't of any use to them -- and they would probably be indulging more into distractions, and it would be hard to blame them.
The goal in that period is to get the students accustomed to non-strictness, do some list processing, and hope that an understanding of well-founded recursion vs. productive corecursion sticks, because these are the things that you can't teach well in another language and which are useful to the student no matter what tools they wind up using in the future.
I would say that is even more reasons to get them learn something that they would not have to unlearn in order to remain harmless :-)
I would rather extra time be spent trying to get the users up to speed on the really interesting and novel parts of the language, such as typeclasses and monads in particular, than lose at least a quarter of my time fiddling about with text processing, a special case API and qualified imports, because those couple of weeks are going to shape many of those students' opinion of the language forever.
More reasons not to show them anything that would reinforce the idea that language should not be taken seriously and is complete waste of time. You would be surprised to learn how bored students are *because* we obsess too much on trying simplify their lives, while they are craving for us to make it more interesting, more challenging. -- Gaby

On Sun, Mar 25, 2012 at 3:47 PM, Gabriel Dos Reis < gdr@integrable-solutions.net> wrote:
We are doing our students no favor, no good, in being condescending to them pretending that they can't handle teaching material that would actually be close real world experience. If we truly believe that they don't have enough time to learn what would really be useful to them, then we are truly wasting their valuable time teaching them things they would have to unlearn before writing good and correct code. The education would have been a complete failure and waste of resources.
When people teach Haskell, it typically isn't to give them "real world experience", but to teach them an interesting programming language and all the great computer science it leads to. Types, laziness, higher-order abstractions are the hard bits to learn, not a string-processing API. If people want to learn how to deal with unicode correctly, I can think of several better places to learn about it than a Haskell course. I don't think it's condescending or impractical to focus on the things that make Haskell unique, rather than teaching a unicode-correct API that could conceivably be written in any other language. Learning that real human text cannot be treated just an independent list of characters is something that takes minutes to hours at most: someone tells you that there are all sorts of exceptions to the list-of-chars paradigm, and then you read an article or two on the language-specific difficulties, learn to use specialized API functions, and then you get on with what you were actually trying to do. So I think saying that ignoring unicode-correct strings a complete failure and waste of resources is a bit hyperbolic, honestly.

On Sun, Mar 25, 2012 at 4:03 PM, Daniel Peebles
On Sun, Mar 25, 2012 at 3:47 PM, Gabriel Dos Reis
wrote: We are doing our students no favor, no good, in being condescending to them pretending that they can't handle teaching material that would actually be close real world experience. If we truly believe that they don't have enough time to learn what would really be useful to them, then we are truly wasting their valuable time teaching them things they would have to unlearn before writing good and correct code. The education would have been a complete failure and waste of resources.
When people teach Haskell, it typically isn't to give them "real world experience", but to teach them an interesting programming language and all the great computer science it leads to.
Yes, but you have to frame it in the context of interesting problems, otherwise it reduces to a series of dry, pointless, uninspiring series functions named f, g, h :-)
Types, laziness, higher-order abstractions are the hard bits to learn, not a string-processing API.
If people want to learn how to deal with unicode correctly, I can think of several better places to learn about it than a Haskell course.
I don't think anybody suggested that a Haskell course should be a substitute for Unicode course. However, I maintain that it isn't an excuse to purposefully teach something that the students have to unlearn.
I don't think it's condescending or impractical to focus on the things that make Haskell unique, rather than teaching a unicode-correct API that could conceivably be written in any other language.
Why should a Unicode-correct API would have to be written in any other language and not Haskell?
Learning that real human text cannot be treated just an independent list of characters is something that takes minutes to hours at most: someone tells you that there are all sorts of exceptions to the list-of-chars paradigm, and then you read an article or two on the language-specific difficulties, learn to use specialized API functions, and then you get on with what you were actually trying to do.
Which brings us back to square zero: Is there any fundamental reason why the language can't provide a good Unicode-correct API to illustrate solutions to text processing problems (many of them interesting) and that can be used in introductory classes instead of having to say the above. Students, like most children, learn by imitation. They will replicate whatever they are shown (for a long period of time, if not forever.) If it is true that students don't time, why should they have to waste that scarce resource listening to the list-of-char paradigm in the first place?
So I think saying that ignoring unicode-correct strings a complete failure and waste of resources is a bit hyperbolic, honestly.
It may be an inconvenient truth, but not hyperbole [ I see you are trying to be polite :-) ] Look at the almost permanent damage done by the culture that equated 'char*' to strings. It may be inconvenient to say, but [Char] isn't any better -- in fact, I'll go further and say: it is spreading the same damage but only with a different syntax. The damage is semantics, no amount of syntax clothing will undo it. I realize that may appear as a strong statement, but think about it. -- Gaby

Hi all, Gabriel Dos Reis wrote:
Look at the almost permanent damage done by the culture that equated 'char*' to strings. It may be inconvenient to say, but [Char] isn't any better -- in fact, I'll go further and say: it is spreading the same damage but only with a different syntax. The damage is semantics, no amount of syntax clothing will undo it. I realize that may appear as a strong statement, but think about it.
Reasonable people might choose to disagree with that. In any case, this is hardly the place to to discuss how to best teach Haskell or programming in general. Nor is the Haskell standard a vehicle to prescribe how Haskell should be taught or for what reasons Haskell should be taught: that can only be decided by individual educators based in their experience and given a specific teaching context. Given intimate knowledge of our specific teaching context here at Nottingham, I can say that removing String = [Char] from the language wouldn't be helpful to us. And we do respect our students. Best, /Henrik -- Henrik Nilsson School of Computer Science The University of Nottingham nhn@cs.nott.ac.uk

On Sun, Mar 25, 2012 at 6:54 PM, Henrik Nilsson
In any case, this is hardly the place to to discuss how to best teach Haskell or programming in general.
Sure, I haven't seen any disagreement with that. Note however that the "pedagogical" arguments was brought in as support for the [Char] definition. It is only natural that it being challenged on that ground.
Nor is the Haskell standard a vehicle to prescribe how Haskell should be taught or for what reasons Haskell should be taught:
I have not seen any assertion to that effect.
that can only be decided by individual educators based in their experience and given a specific teaching context.
True, but should the language definition default to a string type that is one the most unsuited for text processing in the 21st century where global multilingualism abounds? Even C has qualms about that.
Given intimate knowledge of our specific teaching context here at Nottingham, I can say that removing String = [Char] from the language wouldn't be helpful to us.
I have no doubt believing that if all texts my students have to process are US ASCII, [Char] is more than sufficient. So, I have sympathy for your position. However, I doubt [Char] would be adequate if I ask them to shared texts from their diverse cultures. Should the language definition make it much harder to share such experience in classroom when the primary argument for [Char] is pedagogy? -- Gaby

On Sun, Mar 25, 2012 at 5:39 PM, Gabriel Dos Reis
On Sun, Mar 25, 2012 at 6:54 PM, Henrik Nilsson
wrote: In any case, this is hardly the place to to discuss how to best teach Haskell or programming in general.
Sure, I haven't seen any disagreement with that.
As interesting as this discussion is, I think this agreement is the perfect segway to take it off list. Debating the usefulness of String = [Char] does not seem to be productive here anyways. What would be helpful is if alternatives were offered, tested out, and shared among the teaching community. If this requires any changes to the language (other than what is being discussed now), please let us know. The Prelude is still going to export list functions whether they are used on [Char] or not. So we are still in the position of requiring qualified imports for Text functions or needing to change the Text package or something about the language to avoid conflicts with list functions.

Hi all,
True, but should the language definition default to a string type that is one the most unsuited for text processing in the 21st century where global multilingualism abounds? Even C has qualms about that.
Even if we accept these allegation (and I think they are greatly exaggerated), a reasonable position would be yes, for reasons of simplicity and backwards compatibility, given that just the change of default definition of "String" in itself is hardly going to do anything to make it easier to write software that truly can cope with any language. The latter requires tools, libraries, and knowledge well beyond what reasonably should go into any programming language standard.
I have no doubt believing that if all texts my students have to process are US ASCII, [Char] is more than sufficient. So, I have sympathy for your position. However, I doubt [Char] would be adequate if I ask them to shared texts from their diverse cultures. Should the language definition make it much harder to share such experience in classroom when the primary argument for [Char] is pedagogy?
The primary argument is to not break something that works well for most purposes, including teaching, at a huge cost of backwards compatibility for marginal if any real benefits. What would be helpful here is if you could clarify exactly what would stop someone using a library like Text in an exercise like the one you outline above if String = [Char] remains in the language. Best, /Henrik -- Henrik Nilsson School of Computer Science The University of Nottingham nhn@cs.nott.ac.uk

The primary argument is to not break something that works well for most purposes, including teaching, at a huge cost of backwards compatibility for marginal if any real benefits.
I'm persuaded by this argument. And I'm glad that teachers are speaking up in this debate - it's hard to get a balanced discussion on an obscure mailing list. So I'm far from convinced that [Char] is a bad default for the String type. But it's important that as far as possible Text should not be a second class citizen, so I'd support adding OverloadedStrings to the language, and maybe looking at overloading some of the String APIs in the standard libraries. Remember that FilePath is not part of the debate, since neither [Char] nor Text are correct representations of FilePath. If we want to do an evaluation of the pedagogical value of [Char] vs. Text, I suggest writing something like a regex matcher in both and comparing the two. One more thing: historically, performance considerations have been given a fairly low priority in the language design process for Haskell, and rightly so. That doesn't mean performance has been ignored altogether (for example, seq), but it is almost never the case that a concession in other language design principles (e.g. consistency, simplicity) is made for performance reasons alone. We should remember, when thinking about changes to Haskell, that Haskell is the way it is because of this uncompromising attitude, and we should be glad that Haskell is not burdened with (many) legacy warts that were invented to work around performance problems that no longer exist. I'm not saying that this means we should ignore Text as a performance hack, just that performance should not come at the expense of good language design. Cheers, Simon

Hi all, Simon Marlow wrote:
So I'm far from convinced that [Char] is a bad default for the String type. But it's important that as far as possible Text should not be a second class citizen, so I'd support adding OverloadedStrings to the language, and maybe looking at overloading some of the String APIs in the standard libraries.
I agree completely.
One more thing: historically, performance considerations have been given a fairly low priority in the language design process for Haskell, and rightly so. [...] we should be glad that Haskell is not burdened with (many) legacy warts that were invented to work around performance problems that no longer exist. I'm not saying that this means we should ignore Text as a performance hack, just that performance should not come at the expense of good language design.
Well said. And as Isaac Dupree reminded us:
How is Text for small strings currently (e.g. one English word, if not one character)? Can we reasonably recommend it for that? This recent question suggests it's still not great: http://stackoverflow.com/questions/9398572/memory-efficient-strings- in-haskell
even if performance was the sole goal, the "right choice" (as always) is hardly clear cut. Best, /Henrik -- Henrik Nilsson School of Computer Science The University of Nottingham nhn@cs.nott.ac.uk

On Mon, Mar 26, 2012 at 4:57 AM, Simon Marlow
Remember that FilePath is not part of the debate, since neither [Char] nor Text are correct representations of FilePath.
Yes.
If we want to do an evaluation of the pedagogical value of [Char] vs. Text, I suggest writing something like a regex matcher in both and comparing the two.
One more thing: historically, performance considerations have been given a fairly low priority in the language design process for Haskell, and rightly so. That doesn't mean performance has been ignored altogether (for example, seq), but it is almost never the case that a concession in other language design principles (e.g. consistency, simplicity) is made for performance reasons alone. We should remember, when thinking about changes to Haskell, that Haskell is the way it is because of this uncompromising attitude, and we should be glad that Haskell is not burdened with (many) legacy warts that were invented to work around performance problems that no longer exist. I'm not saying that this means we should ignore Text as a performance hack, just that performance should not come at the expense of good language design.
For pedagogical purposes (which seems to be the primary argument for String = [Char]), I am far less concerned about performance than correctness. After going through the discussion this morning again, looking at various arguments, I am not really sure that Haskell isn't burdened with legacy warts ;-) -- Gaby

On 03/26/2012 02:39 AM, Gabriel Dos Reis wrote:
True, but should the language definition default to a string type that is one the most unsuited for text processing in the 21st century where global multilingualism abounds? Even C has qualms about that. ... I have no doubt believing that if all texts my students have to process are US ASCII, [Char] is more than sufficient. So, I have sympathy for your position. However, I doubt [Char] would be adequate if I ask them to shared texts from their diverse cultures.
Uh, while a C char is (usually) just a byte (2^8 bits of information, like Word8 in Haskell), a Haskell Char is a Unicode character (2^21 bits of information). A single C char cannot contain arbitrary Unicode character, while a Haskell Char can, and does. Hence [Char] is (efficiency issues aside) perfectly adequate for dealing with texts written in arbitrary languages. Best regards Christian [I first accidentally send this just to Gabriel, sorry.] -- |------- Dr. Christian Siefkes ------- christian@siefkes.net ------- | Homepage: http://www.siefkes.net/ | Blog: http://www.keimform.de/ | Peer Production Everywhere: http://peerconomy.org/wiki/ |---------------------------------- OpenPGP Key ID: 0x346452D8 -- A bug is a test case you haven't written yet.

On Mon, Mar 26, 2012 at 5:08 AM, Christian Siefkes
On 03/26/2012 02:39 AM, Gabriel Dos Reis wrote:
True, but should the language definition default to a string type that is one the most unsuited for text processing in the 21st century where global multilingualism abounds? Even C has qualms about that. ... I have no doubt believing that if all texts my students have to process are US ASCII, [Char] is more than sufficient. So, I have sympathy for your position. However, I doubt [Char] would be adequate if I ask them to shared texts from their diverse cultures.
Uh, while a C char is (usually) just a byte (2^8 bits of information, like Word8 in Haskell), a Haskell Char is a Unicode character (2^21 bits of information).
It is not the precision of Char or char that is the issue here. It has been clarified at several points that Char is not a Unicode character, but a Unicode code point. Not every Unicode code point represents a Unicode code character, and not every sequence of Unicode code points represents a character or a sequence of Unicode character.
A single C char cannot contain arbitrary Unicode character, while a Haskell Char can, and does. Hence [Char] is (efficiency issues aside) perfectly adequate for dealing with texts written in arbitrary languages.
See above. -- Gaby

On 03/26/2012 01:26 PM, Gabriel Dos Reis wrote:
It is not the precision of Char or char that is the issue here. It has been clarified at several points that Char is not a Unicode character, but a Unicode code point. Not every Unicode code point represents a Unicode code character, and not every sequence of Unicode code points represents a character or a sequence of Unicode character.
What do you mean? Every Unicode character corresponds to one code point, and every code point in the range 0 to 0x10FFFF (excluding the range 0xD800 to 0xDFFF which is reserved for surrogate pairs in UTF-16, and a handful of "noncharacters", see http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Special_code_poin... ) corresponds to one character. Maybe your criticism is that Char does not explicitly prevent these special code points from being assigned? While true, that seems a relatively minor matter. Moreover, a future revision of the Haskell standard could easily declare that a assigning a "forbidden" character results in an error/bottom if that is so desired. Best regards Christian -- |------- Dr. Christian Siefkes ------- christian@siefkes.net ------- | Homepage: http://www.siefkes.net/ | Blog: http://www.keimform.de/ | Peer Production Everywhere: http://peerconomy.org/wiki/ |---------------------------------- OpenPGP Key ID: 0x346452D8 -- Linux is like living in a tipi: no windows, no gates, Apache inside.

On 26 March 2012 13:29, Christian Siefkes
On 03/26/2012 01:26 PM, Gabriel Dos Reis wrote:
It is not the precision of Char or char that is the issue here. It has been clarified at several points that Char is not a Unicode character, but a Unicode code point. Not every Unicode code point represents a Unicode code character, and not every sequence of Unicode code points represents a character or a sequence of Unicode character.
What do you mean? Every Unicode character corresponds to one code point, and every code point in the range 0 to 0x10FFFF (excluding the range 0xD800 to 0xDFFF which is reserved for surrogate pairs in UTF-16, and a handful of "noncharacters", see http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Special_code_poin... ) corresponds to one character.
I think it's best not to use the term "Unicode character" since it's highly ambiguous, to quote from http://www.icu-project.org/docs/papers/forms_of_unicode/: """ We have seen that characters, glyphs, code points, and code units are all different. Unfortunately the term character is vastly overloaded. At various times people can use it to mean any of these things: - An image on paper (glyph) - What an end-user thinks of as a character (grapheme) - What a character encoding standard encodes (code point) - A memory storage unit in a character encoding (code unit) Because of this, ironically, it is best to avoid the use of the term character entirely when discussing character encodings, and stick to the term code point. """ The section http://www.icu-project.org/docs/papers/forms_of_unicode/#h0 is also important to keep in mind.
Maybe your criticism is that Char does not explicitly prevent these special code points from being assigned? While true, that seems a relatively minor matter. Moreover, a future revision of the Haskell standard could easily declare that a assigning a "forbidden" character results in an error/bottom if that is so desired.
Best regards Christian
-- |------- Dr. Christian Siefkes ------- christian@siefkes.net ------- | Homepage: http://www.siefkes.net/ | Blog: http://www.keimform.de/ | Peer Production Everywhere: http://peerconomy.org/wiki/ |---------------------------------- OpenPGP Key ID: 0x346452D8 -- Linux is like living in a tipi: no windows, no gates, Apache inside.
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime
-- Push the envelope. Watch it bend.

On Mon, Mar 26, 2012 at 7:29 AM, Christian Siefkes
On 03/26/2012 01:26 PM, Gabriel Dos Reis wrote:
It is not the precision of Char or char that is the issue here. It has been clarified at several points that Char is not a Unicode character, but a Unicode code point. Not every Unicode code point represents a Unicode code character, and not every sequence of Unicode code points represents a character or a sequence of Unicode character.
What do you mean? Every Unicode character corresponds to one code point,
Yes, but this correspondence is not a bijection -- a great source of confusion that permeates lot of discussions about Unicode characters and texts, including this one (and a previous regarding the Haskell Report.) Very much heart breaking :-(
and every code point in the range 0 to 0x10FFFF (excluding the range 0xD800 to 0xDFFF which is reserved for surrogate pairs in UTF-16, and a handful of "noncharacters", see http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Special_code_poin... ) corresponds to one character.
Maybe your criticism is that Char does not explicitly prevent these special code points from being assigned? While true, that seems a relatively minor matter. Moreover, a future revision of the Haskell standard could easily declare that a assigning a "forbidden" character results in an error/bottom if that is so desired.
It is not just a matter of clarification that certain things are forbidden. I believe it would be a great mistake to qualify it as minor. How do you handle normalization if you expose the texts as sequence of unrelated code points that can be freely taken apart and combined? - Gaby

"Gabriel" == Gabriel Dos Reis
writes:
Gabriel> On Mon, Mar 26, 2012 at 5:08 AM, Christian Siefkes Gabriel> It is not the precision of Char or char that is the issue Gabriel> here. It has been clarified at several points that Char is Gabriel> not a Unicode character, but a Unicode code point. That's not what the standard says: 6.1.2 Characters and Strings The character type Char is an enumeration whose values represent Unicode characters [2]. [2] Unicode Consortium. Unicode standard. http://unicode.org/standard/standard.html. -- Colin Adams Preston Lancashire () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

I am starting up the proposal.
http://hackage.haskell.org/trac/haskell-prime/ticket/143
http://hackage.haskell.org/trac/haskell-prime/wiki/OpaqueText
Unfortunately I haven't had any time to work on this for the last week
and won't for 2 more weeks.
Your help is appreciated. I think the first step is to move list
functions into a separate module for the Text package to see if we can
get rid of name conflicts with the Prelude.
On Sat, Mar 31, 2012 at 11:49 PM, Colin Paul Adams
"Gabriel" == Gabriel Dos Reis
writes: Gabriel> On Mon, Mar 26, 2012 at 5:08 AM, Christian Siefkes
Gabriel> It is not the precision of Char or char that is the issue Gabriel> here. It has been clarified at several points that Char is Gabriel> not a Unicode character, but a Unicode code point.
That's not what the standard says:
6.1.2 Characters and Strings
The character type Char is an enumeration whose values represent Unicode characters [2].
[2] Unicode Consortium. Unicode standard. http://unicode.org/standard/standard.html.
-- Colin Adams Preston Lancashire () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime

On Mon, Mar 26, 2012 at 06:08, Christian Siefkes
On 03/26/2012 02:39 AM, Gabriel Dos Reis wrote:
True, but should the language definition default to a string type that is one the most unsuited for text processing in the 21st century where global multilingualism abounds? Even C has qualms about that. ... I have no doubt believing that if all texts my students have to process are US ASCII, [Char] is more than sufficient. So, I have sympathy for your position. However, I doubt [Char] would be adequate if I ask them to shared texts from their diverse cultures.
Uh, while a C char is (usually) just a byte (2^8 bits of information, like Word8 in Haskell), a Haskell Char is a Unicode character (2^21 bits of information). A single C char cannot contain arbitrary Unicode character, while a Haskell Char can, and does. Hence [Char] is (efficiency issues aside) perfectly adequate for dealing with texts written in arbitrary languages.
...as long as you ignore combining characters and the like. I claim ignoring them in this way is just continuing the same "good enough for my language" attitude that has plagued text handling ever since someone got the notion that maybe text processing should consider more than just ISO 8859/1 and got roundly pooh-poohed by the community. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

On Sun, Mar 25, 2012 at 5:19 AM, Greg Weber
On Sat, Mar 24, 2012 at 7:26 PM, Gabriel Dos Reis
wrote: On Sat, Mar 24, 2012 at 9:09 PM, Greg Weber
wrote: Problem: we want to write beautiful (and possibly inefficient) code that is easy to explain. If nothing else, this is pedagologically important. The goals of this code are to: * use list processing pattern matching and functions on a string type
I may have missed this question so I will ask it (apologies if it is a repeat): Why is it believed that list processing pattern matching is appropriate or the right tool for text processing?
Nobody said it is the right tool for text processing. In fact, I think we all agreed it is the wrong tool for many cases. But it is easy for students to understand since they are already being taught to use lists for everything else. It would be great if you can talk with teachers of Haskell and figure out a better way to teach text processing.
I think a helpful question might be whether [Char] is mainly used to teach about lists, or whether it's mainly used to teach about how to do Unicode text processing correctly. If it's mainly used to teach about lists, pattern matching, etc., as I suspect, then the fine details of Unicode don't matter so much, you could even work with ASCII-only strings and it would work equally well for teaching about lists. How to do Unicode text processing correctly is a topic that seems like it would become important much later, when someone's going to write code that's meant to be used in a production environment. Most students in an introductory university course probably don't get close to that point. If you do want to teach about how to do Unicode text processing correctly (which, for the record, is an important issue irrespective of which programming language you're using) then presumably you want to teach about Text, but hopefully your students will be more advanced by then and it won't be so much of a problem. I'm not really sure what that recommends in terms of policy. Mainly what you need is "it should be possible to work with lists of characters" and "it should be possible to work with Text", which we more-or-less have already. The important bits seem to be OverloadedStrings and ideally some way to avoid a pervasive API bias towards the wrong type (the tradeoffs there are probably more tricky). (So... basically what Simon M. said.)

Can anyone explain how the tangent discussion of the finer points of Unicode and the value of teaching [Char] is relevant to the proposal under discussion? We aren't going to completely eliminate String and break most existing Haskell code as Simon said. String is just a list anyways, and lists are here to stay in Haskell. I would like to get back to working on the proposal and determining how Text can be added to the language. Thank you, Greg weber

On Mon, Mar 26, 2012 at 9:21 AM, Greg Weber
Can anyone explain how the tangent discussion of the finer points of Unicode and the value of teaching [Char] is relevant to the proposal under discussion? We aren't going to completely eliminate String and break most existing Haskell code as Simon said. String is just a list anyways, and lists are here to stay in Haskell.
I would like to get back to working on the proposal and determining how Text can be added to the language.
The discussion started because of the question of whether Text should support list processing functions at all, and if so how. That is a very legitimate question related to the Text proposal, at least if you are concerned about correct semantics. Once you are there, the discussion about Unicode characters is unavoidable, and is very much within the scope of discussing Text. I may have missed the proposal to eliminate list from Haskell, though. -- Gaby

I would like to get back to working on the proposal and determining how Text can be added to the language.
The discussion started because of the question of whether Text should support list processing functions at all, and if so how. That is a very legitimate question related to the Text proposal, at least if you are concerned about correct semantics. Once you are there, the discussion about Unicode characters is unavoidable, and is very much within the scope of discussing Text.
Can we take a break from arguing then, and can you create a wiki page that explains how you think Text should behave? This conversation is getting extremely long and repetitive. Can someone please show me how to create a wiki page on the Haskell proposals site (or suggest a different appropriate site)?

On Mon, Mar 26, 2012 at 8:35 AM, Gábor Lehel
On Sun, Mar 25, 2012 at 5:19 AM, Greg Weber
wrote: On Sat, Mar 24, 2012 at 7:26 PM, Gabriel Dos Reis
wrote: On Sat, Mar 24, 2012 at 9:09 PM, Greg Weber
wrote: Problem: we want to write beautiful (and possibly inefficient) code that is easy to explain. If nothing else, this is pedagologically important. The goals of this code are to: * use list processing pattern matching and functions on a string type
I may have missed this question so I will ask it (apologies if it is a repeat): Why is it believed that list processing pattern matching is appropriate or the right tool for text processing?
Nobody said it is the right tool for text processing. In fact, I think we all agreed it is the wrong tool for many cases. But it is easy for students to understand since they are already being taught to use lists for everything else. It would be great if you can talk with teachers of Haskell and figure out a better way to teach text processing.
I think a helpful question might be whether [Char] is mainly used to teach about lists, or whether it's mainly used to teach about how to do Unicode text processing correctly. If it's mainly used to teach about lists, pattern matching, etc., as I suspect, then the fine details of Unicode don't matter so much, you could even work with ASCII-only strings and it would work equally well for teaching about lists.
I agree that if the purpose is to teach list and list pattern matching, it does not matter much what the element type is as long as it follows reasonable constraints. However, as someone observed earlier, the Haskell Report is not a vehicle to prescribe how Haskell should be taught or for what reasons Haskell should be taught. That argument, while it was made to support String = [Char] for pedagogical purposes, is in fact a good argument against.
How to do Unicode text processing correctly is a topic that seems like it would become important much later, when someone's going to write code that's meant to be used in a production environment. Most students in an introductory university course probably don't get close to that point. If you do want to teach about how to do Unicode text processing correctly (which, for the record, is an important issue irrespective of which programming language you're using) then presumably you want to teach about Text, but hopefully your students will be more advanced by then and it won't be so much of a problem.
The Haskell Report claims very prominently that it uses the Unicode character set. The question is whether it should be using it correctly at all, and if so should it even try to pretend that its default string type use those characters correctly. I do not subscribe to the notion that simple correct text processing is something students would have to learn only in "advanced" classes dedicated to Unicode. In the region of this side of the Atlantic Ocean where I teach, the student population is very diverse and I do think it would responsible to stand in front of students and say: You are all welcome; this class is open to all cultures and we are committed to diversity and equal opportunity. However, for the purpose of simplicity and pedagogy, we would refrain from looking at texts from this and other students. If you are really interested, you should take an advanced class. I hope you enjoy the class. Furthermore, I am not convinced it is a good strategy to try hard to reflect the notion that text processing is hard, either in the language or in its presentation (e.g. it is advanced topic, you need to be advanced before we talk about it.)
I'm not really sure what that recommends in terms of policy. Mainly what you need is "it should be possible to work with lists of characters" and "it should be possible to work with Text", which we more-or-less have already. The important bits seem to be OverloadedStrings and ideally some way to avoid a pervasive API bias towards the wrong type (the tradeoffs there are probably more tricky). (So... basically what Simon M. said.)
-- Gaby

On 26 Mar 2012, at 15:30, Gabriel Dos Reis wrote:
The Haskell Report claims very prominently that it uses the Unicode character set. The question is whether it should be using it correctly at all, and if so should it even try to pretend that its default string type use those characters correctly.
In the region of this side of the Atlantic Ocean where I teach, the student population is very diverse
Prelude> putStrLn (take 5 "Fröhßen") Fröhß Prelude> putStrLn "фхцчшщъыьэюя" фхцчшщъыьэюя Prelude> putStrLn "देवनागरी" देवनागरी Prelude> putStrLn "아햇글" 아햇글 Prelude> putStrLn (take 2 "아햇글") 아햇 Whilst there are some places that things do not work out fully correctly - several such warts have been mentioned already in this thread - on the whole, I think Haskell's [Char] is fundamentally not as culturally restrictive as you make out. Regards, Malcolm

On Mon, Mar 26, 2012 at 7:48 AM, Malcolm Wallace
In the region of this side of the Atlantic Ocean where I teach, the student population is very diverse
Prelude> putStrLn (take 5 "Fröhßen") Fröhß
ghci> putStrLn "Fro\x0308hßen" Fröhßen ghci> putStrLn (take 5 "Fro\x0308hßen") Fröh Your example works because your input happens to be in a normal form. -- Johan

On 26 Mar 2012, at 16:20, Johan Tibell wrote:
On Mon, Mar 26, 2012 at 7:48 AM, Malcolm Wallace
wrote: In the region of this side of the Atlantic Ocean where I teach, the student population is very diverse
Prelude> putStrLn (take 5 "Fröhßen") Fröhß
ghci> putStrLn "Fro\x0308hßen" Fröhßen ghci> putStrLn (take 5 "Fro\x0308hßen") Fröh
Your example works because your input happens to be in a normal form.
Yes indeed. And I think it would be perfectly reasonable for the String (= [Char]) API to have a function "normalise :: String -> String" which would let the user deal with this issue as they see fit. After all, if you are aware of the difference between combining characters and normalised characters, then you will want to make your own decision about what semantics you want from operations like "take". Regards, Malcolm

On Mon, Mar 26, 2012 at 8:34 AM, Malcolm Wallace
Yes indeed. And I think it would be perfectly reasonable for the String (= [Char]) API to have a function "normalise :: String -> String" which would let the user deal with this issue as they see fit. After all, if you are aware of the difference between combining characters and normalised characters, then you will want to make your own decision about what semantics you want from operations like "take".
Normalization isn't quite enough unfortunately, as it does solve e.g. upcase = map toUppper You need all-at-once functions on strings (which we could add.) I'm just pointing out that most (all?) list functions do the wrong thing when used on Strings.

On 03/26/2012 05:50 PM, Johan Tibell wrote:
Normalization isn't quite enough unfortunately, as it does solve e.g.
upcase = map toUppper
You need all-at-once functions on strings (which we could add.) I'm just pointing out that most (all?) list functions do the wrong thing when used on Strings.
Hm, do you have any other examples besides toUpper/toLower? Also, that example is not really an argument against using list functions on strings (which, by any reasonable definition, seem to be "sequences of characters" -- whether that sequence is represented as a list, an array, or something else, seems more like an implementation detail to me). Rather, it indicates the fact that Char.toUpper may have to wrong type. If its type was Char -> String instead of Char -> Char, it could handle things like toUppper 'ß' == "SS" correctly. Then stuff like upcase = concatMap toUppper would work fine. As it is, the problem seems to be with Char, not with [Char]. Best regards Christian -- |------- Dr. Christian Siefkes ------- christian@siefkes.net ------- | Homepage: http://www.siefkes.net/ | Blog: http://www.keimform.de/ | Peer Production Everywhere: http://peerconomy.org/wiki/ |---------------------------------- OpenPGP Key ID: 0x346452D8 -- Failure is just success rounded down. -- Ryan North

On Mon, Mar 26, 2012 at 9:42 AM, Christian Siefkes
On 03/26/2012 05:50 PM, Johan Tibell wrote:
Normalization isn't quite enough unfortunately, as it does solve e.g.
upcase = map toUppper
You need all-at-once functions on strings (which we could add.) I'm just pointing out that most (all?) list functions do the wrong thing when used on Strings.
Hm, do you have any other examples besides toUpper/toLower?
length, cons, head, tail, filter, folds, anything that works on an element-by-element basis.
Also, that example is not really an argument against using list functions on strings (which, by any reasonable definition, seem to be "sequences of characters" -- whether that sequence is represented as a list, an array, or something else, seems more like an implementation detail to me).
I agree on the second part. As someone pointed out earlier, we should be careful in using the word character as the Unicode code point doesn't correspond well to the commonly used concept of a character. What we have today is really: type String = [CodePoint] What you would normally think of as a character might consists of several code points.
Rather, it indicates the fact that Char.toUpper may have to wrong type. If its type was Char -> String instead of Char -> Char, it could handle things like toUppper 'ß' == "SS" correctly. Then stuff like
upcase = concatMap toUppper
would work fine.
Yes.
As it is, the problem seems to be with Char, not with [Char].
[Char] is a semantically OK representation of a Unicode string, using an array like text does is simply an optimization. However, using the list function defined by the Prelude is not a good idea if you want to process a Unicode string correctly. -- Johan

On 03/26/2012 06:58 PM, Johan Tibell wrote:
On Mon, Mar 26, 2012 at 9:42 AM, Christian Siefkes
wrote: On 03/26/2012 05:50 PM, Johan Tibell wrote:
Normalization isn't quite enough unfortunately, as it does solve e.g.
upcase = map toUppper
You need all-at-once functions on strings (which we could add.) I'm just pointing out that most (all?) list functions do the wrong thing when used on Strings.
Hm, do you have any other examples besides toUpper/toLower?
length, cons, head, tail, filter, folds, anything that works on an element-by-element basis.
Hm, but aren't these all matters of Unicode normalization? Your argument seems to go in circles, since above you wrote: "Normalization isn't quite enough unfortunately".
I agree on the second part. As someone pointed out earlier, we should be careful in using the word character as the Unicode code point doesn't correspond well to the commonly used concept of a character.
Indeed, as Thomas pointed out, the term "Unicode character" is not well-defined, hence we should prefer the term "Unicode code point" which seems to come closest. What else should length return, if not the number of code points in a string (possibly after normalization)? What else should head or take x return, if not the first (x) code point(s) of a string (possibly after normalization)? Best regards Christian -- |------- Dr. Christian Siefkes ------- christian@siefkes.net ------- | Homepage: http://www.siefkes.net/ | Blog: http://www.keimform.de/ | Peer Production Everywhere: http://peerconomy.org/wiki/ |---------------------------------- OpenPGP Key ID: 0x346452D8 -- What everybody echoes or in silence passes by as true today may turn out to be falsehood tomorrow. -- Henry D. Thoreau, Walden

On Mon, Mar 26, 2012 at 1:12 PM, Christian Siefkes
On 03/26/2012 06:58 PM, Johan Tibell wrote:
On Mon, Mar 26, 2012 at 9:42 AM, Christian Siefkes
wrote: On 03/26/2012 05:50 PM, Johan Tibell wrote:
Normalization isn't quite enough unfortunately, as it does solve e.g.
upcase = map toUppper
You need all-at-once functions on strings (which we could add.) I'm just pointing out that most (all?) list functions do the wrong thing when used on Strings.
Hm, do you have any other examples besides toUpper/toLower?
length, cons, head, tail, filter, folds, anything that works on an element-by-element basis.
Hm, but aren't these all matters of Unicode normalization? Your argument seems to go in circles, since above you wrote: "Normalization isn't quite enough unfortunately".
Unicode contains a set of precomposed characters, like ö, that can be
normalized to a single code point, but this is not true of every
combination of characters.
Prelude> "ā́"
"a\772\769"
As far as I know, there is no representation of ā́ which uses a single
code point.
--
Dave Menendez

On Mon, Mar 26, 2012 at 11:42 AM, Christian Siefkes
Also, that example is not really an argument against using list functions on strings (which, by any reasonable definition, seem to be "sequences of characters" -- whether that sequence is represented as a list, an array, or something else, seems more like an implementation detail to me).
The correctness problems isn't that a list is used to represent a sequence. The problem is that that representational detail (and I agree with you it is an implementation of sequence) is made part of the API on strings. -- Gaby

Hi all, Johan Tibell wrote:
Normalization isn't quite enough unfortunately, as it does solve e.g.
upcase = map toUppper
You need all-at-once functions on strings (which we could add.) I'm just pointing out that most (all?) list functions do the wrong thing when used on Strings.
So, is the argument to deprecate Char, then? As long as Haskell allows Chars to be handled in isolation, it would seem impossible to prevent naive users from accidentally stumbling over the complexities of Unicode? And, to be honest, even with a well thought-out Text API, I don't think it is going to be possible to hide the complexities of Unicode. For example, just take a quick look at http://en.wikipedia.org/wiki/Unicode_equivalence There are canonical equivalence and compatibility, and each has two normal forms (fully composed and fully decomposed), and "each of these four normal forms can be used in text processing". As an example of the difference between "equivalent" and "compatible", the ligature "ff" is "compatible - but not canonically equivalent" to a sequence of two characters latin "f", meaning they "may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others". Is it realistic to think that if only Haskell used Text and not String = [Char], a naive user/beginner would be able to write correct code for all manner of text processing tasks without needing to understand a great deal about Unicode? I'm sorry, but I'm rather sceptical. So I reiterate that I see little if any gain, be it in terms of making life simpler for beginners, making Haskell more "multi cultural", or giving Haskell applications in general a performance boost, in deprecating String = [Char] and mandating the use of Text. But the costs would be massive. Best, /Henrik -- Henrik Nilsson School of Computer Science The University of Nottingham nhn@cs.nott.ac.uk

On Mon, Mar 26, 2012 at 9:59 AM, Henrik Nilsson
So, is the argument to deprecate Char, then? As long as Haskell allows Chars to be handled in isolation, it would seem impossible to prevent naive users from accidentally stumbling over the complexities of Unicode?
I haven't proposed anything at all. Someone asked why one should prefer Text to String. I showed that the former is more correct (given the currently available APIs) and much faster.
There are canonical equivalence and compatibility, and each has two normal forms (fully composed and fully decomposed), and "each of these four normal forms can be used in text processing".
As an example of the difference between "equivalent" and "compatible", the ligature "ff" is "compatible - but not canonically equivalent" to a sequence of two characters latin "f", meaning they "may be treated the same way in some applications (such as sorting and indexing), but not in others; and may be substituted for each other in some situations, but not in others".
Is it realistic to think that if only Haskell used Text and not String = [Char], a naive user/beginner would be able to write correct code for all manner of text processing tasks without needing to understand a great deal about Unicode?
I'm sorry, but I'm rather sceptical.
Why? We can hide most of these details behind the Text API. We can pick which encoding and normal form is used internally and then have the externally provided API for e.g. sorting do the right thing.
So I reiterate that I see little if any gain, be it in terms of making life simpler for beginners, making Haskell more "multi cultural", or giving Haskell applications in general a performance boost, in deprecating String = [Char] and mandating the use of Text. But the costs would be massive.
I agree and thus I don't propose we do something like that. The way this will go down is that part of the Prelude and other base modules will eventually be replaced by more modern packages (e.g. see system-fileio) and the use of String will decline. Unfortunately it's a bit of a painful transition as today we need to convert back and forth between the two string types quite a lot. -- Johan

On Mon, Mar 26, 2012 at 08:20:45AM -0700, Johan Tibell wrote:
On Mon, Mar 26, 2012 at 7:48 AM, Malcolm Wallace
wrote: In the region of this side of the Atlantic Ocean where I teach, the student population is very diverse
Prelude> putStrLn (take 5 "Fröhßen") Fröhß
ghci> putStrLn "Fro\x0308hßen" Fröhßen ghci> putStrLn (take 5 "Fro\x0308hßen") Fröh
Your example works because your input happens to be in a normal form.
I am very unicode-ignorant, so apologies if I have misunderstood something, but doesn't Text do the same thing? Prelude T> import Data.Text.IO as T Prelude T T> T.putStrLn (T.take 5 (T.pack "Fro\x0308hßen")) Fröh Maybe your point is that neither "take" function should be used with unicode strings, but I don't see how advocating the Text type is going to help with that. Thanks Ian

On Mon, Mar 26, 2012 at 10:12 AM, Ian Lynagh
I am very unicode-ignorant, so apologies if I have misunderstood something, but doesn't Text do the same thing?
Prelude T> import Data.Text.IO as T Prelude T T> T.putStrLn (T.take 5 (T.pack "Fro\x0308hßen")) Fröh
Maybe your point is that neither "take" function should be used with unicode strings, but I don't see how advocating the Text type is going to help with that.
We already covered this. Text inherited a list-based API, even if that sometimes doesn't make sense. To work with Unicode you need more specific functions for different tasks. Text only implements a few so far, like case conversion and case-less comparison, and asks you to use text-icu for the rest. -- Johan

Can the unicode experts here propose a Text API whose functions work
for all Unicode (start by removing list functions)? If there is such a
satisfactory API and it does not conflict with the Prelude we could
use it unqualified.
On Mon, Mar 26, 2012 at 10:21 AM, Johan Tibell
On Mon, Mar 26, 2012 at 10:12 AM, Ian Lynagh
wrote: I am very unicode-ignorant, so apologies if I have misunderstood something, but doesn't Text do the same thing?
Prelude T> import Data.Text.IO as T Prelude T T> T.putStrLn (T.take 5 (T.pack "Fro\x0308hßen")) Fröh
Maybe your point is that neither "take" function should be used with unicode strings, but I don't see how advocating the Text type is going to help with that.
We already covered this. Text inherited a list-based API, even if that sometimes doesn't make sense.
To work with Unicode you need more specific functions for different tasks. Text only implements a few so far, like case conversion and case-less comparison, and asks you to use text-icu for the rest.
-- Johan
_______________________________________________ Haskell-prime mailing list Haskell-prime@haskell.org http://www.haskell.org/mailman/listinfo/haskell-prime

On Mon, Mar 26, 2012 at 13:12, Ian Lynagh
Maybe your point is that neither "take" function should be used with unicode strings, but I don't see how advocating the Text type is going to help with that.
I think we established earlier that the list-like operations on Text are a backward compatibility wart. Either they should go away, or they should be modified to operate on some other level than codepoints. Probably the way the ecosystem should work is that [Char] (or possibly a packed version thereof, sort of like lazy ByteStrings with Word32 instead of Word8 as the fundamental unit) is the codepoint view and Text is the grapheme view; both are necessary at various times, but the grapheme view is the more natural one for text /per se/. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

On Mon, Mar 26, 2012 at 6:21 PM, Brandon Allbery
On Mon, Mar 26, 2012 at 13:12, Ian Lynagh
wrote: Maybe your point is that neither "take" function should be used with unicode strings, but I don't see how advocating the Text type is going to help with that.
I think we established earlier that the list-like operations on Text are a backward compatibility wart. Either they should go away, or they should be modified to operate on some other level than codepoints. Probably the way the ecosystem should work is that [Char] (or possibly a packed version thereof, sort of like lazy ByteStrings with Word32 instead of Word8 as the fundamental unit) is the codepoint view and Text is the grapheme view; both are necessary at various times, but the grapheme view is the more natural one for text /per se/.
Does this mean we've firmly established that there currently is *no* completely satisfactory method of dealing with Unicode in existence today? In that case, even if it /will be/ a good idea one day, can't we agree that it's not the right time to deprecate String = [Char]? The language has a good history, I understand, of not standardising that which is not implemented and in common use, so if we'd like to change Text before introducing it to the language, I say let's do that separately. No-one's yet argued against OverloadedStrings. I think there /is/ an argument to be made, that it introduces ambiguity and could break existing programs (probably we can extend defaulting to take care of this, but I think there are people who'd be happier if we killed defaulting too). Too much polymorphism /can/ be a bad thing. But I think there's a serious chance we can make that happen, and make Text a bit more pleasant to work with. A passing thought: nearly anything can be made an instance of IsString, via something read-like. This prospect upsets me a little :) Maybe it would make sense to introduce additional methods, as in Num, to make sure that some sense is maintained (perhaps toString?)

On Mon, Mar 26, 2012 at 5:56 PM, Ben Millwood
On Mon, Mar 26, 2012 at 13:12, Ian Lynagh
wrote: Maybe your point is that neither "take" function should be used with unicode strings, but I don't see how advocating the Text type is going to help with that.
I think we established earlier that the list-like operations on Text are a backward compatibility wart. Either they should go away, or they should be modified to operate on some other level than codepoints. Probably the way the ecosystem should work is that [Char] (or possibly a packed version thereof, sort of like lazy ByteStrings with Word32 instead of Word8 as
On Mon, Mar 26, 2012 at 6:21 PM, Brandon Allbery
wrote: the fundamental unit) is the codepoint view and Text is the grapheme view; both are necessary at various times, but the grapheme view is the more natural one for text /per se/.
Does this mean we've firmly established that there currently is *no* completely satisfactory method of dealing with Unicode in existence today? In that case, even if it /will be/ a good idea one day, can't we agree that it's not the right time to deprecate String = [Char]? The language has a good history, I understand, of not standardising that which is not implemented and in common use, so if we'd like to change Text before introducing it to the language, I say let's do that separately.
Agreed. Haskell-prime has in the past gone out of its way not to standardize anything that hasn't actually been implemented somewhere.
No-one's yet argued against OverloadedStrings. I think there /is/ an argument to be made, that it introduces ambiguity and could break existing programs (probably we can extend defaulting to take care of this, but I think there are people who'd be happier if we killed defaulting too). Too much polymorphism /can/ be a bad thing. But I think there's a serious chance we can make that happen, and make Text a bit more pleasant to work with.
I would agree that OverloadedStrings definitely needs to work with defaulting better! I've been bitted by this in my own code. In the past I have longed for a trivial class class Defaulting a which merely served to meet the conditions for extended default from http://www.haskell.org/ghc/docs/6.6/html/users_guide/ch03s04.html#extended-d... * *
*All* of the classes Ci are single-parameter type classes. At least one of the classes Ci is numeric, *or is Show, Eq, or Ord*.
by adding Defaulting to the list of extra classes the language permits. In the special case of ghc where we have polymorphic kinds now, the Defaulting class could allow for defaulting rules to then be applied trivially for higher kinds, and when you have new classes you really want defaulting to apply, like in this case IsString, you could just add Defaulting as a superclass constraint. A passing thought: nearly anything can be made an instance of
IsString, via something read-like. This prospect upsets me a little :) Maybe it would make sense to introduce additional methods, as in Num, to make sure that some sense is maintained (perhaps toString?)
I don't like the idea of adding a toString to IsString. One of the more compelling usecases is in DSLs. Just because I can make an instance IsString Expr where fromString = StringLiteral to get a syntaxless embedding of string literals into my language for some expression type, doesn't mean that every expression is a string literal. Ultimately, abuses of IsString for parsing should probably just be treated with the same general distaste as the use of Show instances that don't show source representations. -Edward

No-one's yet argued against OverloadedStrings. I think there /is/ an argument to be made, that it introduces ambiguity and could break existing programs (probably we can extend defaulting to take care of
Definitely, I have ones that would need some :: sprinkled in.
this, but I think there are people who'd be happier if we killed defaulting too). Too much polymorphism /can/ be a bad thing. But I
Also me :) I'd like that 'default' keyword back too, it's too useful for variables names to waste on something so obscure. That said, I like OverloadedStrings.

Ben Millwood wrote:
No-one's yet argued against OverloadedStrings. I think there /is/ an argument to be made, that it introduces ambiguity and could break existing programs (probably we can extend defaulting to take care of this, but I think there are people who'd be happier if we killed defaulting too).
For the record, I have a more fundamental objection to OverloadedStrings. When library authors write partial functions for the fromString method - and we see in practice that they do so - it creates the shocking situation in which invalid syntax of a literal is only caught at run time. Instead, Haskell' should provide a pragma, or other syntactic mechanism, by which compilers can allow specific types other than String for string literals at *compile* time in a monomorphic context. The implementation of that syntax would be compiler dependent, of course. For GHC, Template Haskell could help. Another approach would be for a compiler to have a built-in way of embedding, say, Text literals directly in object code. For people who like OverloadedStrings despite this problem, it could be available via a pragma. But never by default - it should be possible to get fundamental string types like Text and ByteString for string literals without having to turn on OverloadedStrings. In GHC, quasi-quoters are the right way to provide convenient user-defined syntax for string-like types other than the canonical string types. Theoretically, polymorphism for numeric literals is just as bad. But in practice, it seems to be much more rare for people to implement fromIntegral and fromRational with dangerous functions. -Yitz

On Mon, Mar 26, 2012 at 13:12, Ian Lynagh
Maybe your point is that neither "take" function should be used with unicode strings, but I don't see how advocating the Text type is going to help with that.
I think we established earlier that the list-like operations on Text are a backward compatibility wart. Either they should go away, or they should be modified to operate on some other level than codepoints. Probably the way the ecosystem should work is that [Char] (or possibly a packed version thereof, sort of like lazy ByteStrings with Word32 instead of Word8 as the fundamental unit) is the codepoint view and Text is the grapheme view; both are necessary at various times, but the grapheme view is the more natural one for text /per se/. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

On Sat, Mar 24, 2012 at 8:51 PM, Johan Tibell
On Sat, Mar 24, 2012 at 5:54 PM, Gabriel Dos Reis
wrote: I think there is a confusion here. A Unicode character is an abstract entity. For it to exist in some concrete form in a program, you need an encoding. The fact that char16_t is 16-bit wide is irrelevant to whether it can be used in a representation of a Unicode text, just like uint8_t (e.g. 'unsigned char') can be used to encode Unicode string despite it being only 8-bit wide. You do not need to make the character type exactly equal to the type of the individual element in the text representation.
Well, if you have a >21-bit type you can declare its value to be a Unicode code point (which are numbered.)
That is correct. Because not all Unicode points represent characters, and not all Unicode code point sequences represent valid characters, even if you have that >21-bit type T, the list type [T] would still not be a good string type.
Using a char* that you claim contain utf-8 encoded data is bad for safety, as there is no guarantee that that's indeed the case.
Indeed, and that is why a Text should be an abstract datatype, hiding the concrete implementation away from the user.
Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient as far as text processing goes; you also need a localization at the minimum. It is the combination of the two that gives some meaning to text representation and operations.
text does that via ICU. Some operations would be possible without using the locale, if it wasn't for those Turkish i:s. :/
yeah, 7 bits should be enough for every character ;-) -- Gaby

On Sat, Mar 24, 2012 at 5:33 PM, Freddie Manners
To add my tuppence-worth on this, addressed to no-one in particular:
(1) I think getting hung up on UTF-8 correctness is a distraction here. I can't imagine anyone suggesting that the C/C++ standards removed support for (char*) because it wasn't UTF-8 correct: sure, you'd recommend people use a different type when it matters, but the language standard itself shouldn't be driven by technical issues that don't affect most people most of the time. I'm sure it's good engineering practice to worry about these things, but the standard isn't there to encourage good engineering practice.
C++ does not consider 'char*' as the type of a string. It has a standard template std::basic_string that can be instantiated on char (giving std::string) or encoding type (of unicode characters) char16_t, char32_t, and wchar_t giving rise to u16string, u32string, and wstring. It has a large number of functions to manipulate a string as a sequence (Haskell's statu quo) or as a text thanks to an elaborated localization machinery. -- Gaby, back to lurking mode

On Sat, Mar 24, 2012 at 16:16, Ian Lynagh
On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote:
Using list-based operations on Strings are almost always wrong
Data.Text seems to think that many of them are worth reimplementing for Text. It looks like someone's systematically gone through Data.List. And in fact, very few functions there /don't/ look like they are directly equivalent to list functions.
I was under the impression they have been very carefully designed to do the right thing with characters represented by multiple codepoints, which is something the String version *cannot* do. It would help if Bryan were involved with this discussion, though. (I'm cc:ing him on this.) Since the whole point of Data.Text is to handle stuff like this properly I would be surprised if your assertion that
upcase :: String -> String
upcase = map toUpper
This is no more incorrect than upcase = Data.Text.map toUpper
is correct. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

On Sat, Mar 24, 2012 at 2:31 PM, Brandon Allbery
I was under the impression they have been very carefully designed to do the right thing with characters represented by multiple codepoints, which is something the String version *cannot* do. It would help if Bryan were involved with this discussion, though. (I'm cc:ing him on this.) Since the whole point of Data.Text is to handle stuff like this properly I would be surprised if your assertion that
upcase :: String -> String upcase = map toUpper
This is no more incorrect than upcase = Data.Text.map toUpper
is correct.
This is simply not possible given the Unicode specification. There's no code point that corresponds to the two characters used to represent an upcased version of the essets. I think the list based API predates Bryan. -- Johan

On Sat, Mar 24, 2012 at 05:31:48PM -0400, Brandon Allbery wrote:
On Sat, Mar 24, 2012 at 16:16, Ian Lynagh
wrote: On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote:
Using list-based operations on Strings are almost always wrong
Data.Text seems to think that many of them are worth reimplementing for Text. It looks like someone's systematically gone through Data.List. And in fact, very few functions there /don't/ look like they are directly equivalent to list functions.
I was under the impression they have been very carefully designed to do the right thing with characters represented by multiple codepoints, which is something the String version *cannot* do. It would help if Bryan were involved with this discussion, though. (I'm cc:ing him on this.) Since the whole point of Data.Text is to handle stuff like this properly I would be surprised if your assertion that
upcase :: String -> String
upcase = map toUpper
This is no more incorrect than upcase = Data.Text.map toUpper
is correct.
I don't see how it could do any better, given both use toUpper :: Char -> Char to do the hard work. That's why there is also a Data.Text.toUpper :: Text -> Text Based on a very quick skim I think that there are only 3 such functions in Data.Text (toCaseFold, toLower, toUpper), although the 3 justification functions may handle double-width characters properly. Anyway, my main point is that I don't think that either text or String should make it any easier for people to get things right. It's true that currently only text makes correct case-conversions easy, but only because no-one's written Data.String.to* yet. Thanks Ian

On 24 March 2012 22:27, Ian Lynagh
On Sat, Mar 24, 2012 at 05:31:48PM -0400, Brandon Allbery wrote:
On Sat, Mar 24, 2012 at 16:16, Ian Lynagh
wrote: On Sat, Mar 24, 2012 at 11:50:10AM -0700, Johan Tibell wrote:
Using list-based operations on Strings are almost always wrong
Data.Text seems to think that many of them are worth reimplementing for Text. It looks like someone's systematically gone through Data.List. And in fact, very few functions there /don't/ look like they are directly equivalent to list functions.
I was under the impression they have been very carefully designed to do the right thing with characters represented by multiple codepoints, which is something the String version *cannot* do. It would help if Bryan were involved with this discussion, though. (I'm cc:ing him on this.) Since the whole point of Data.Text is to handle stuff like this properly I would be surprised if your assertion that
upcase :: String -> String
upcase = map toUpper
This is no more incorrect than upcase = Data.Text.map toUpper
is correct.
I don't see how it could do any better, given both use toUpper :: Char -> Char to do the hard work. That's why there is also a Data.Text.toUpper :: Text -> Text
Based on a very quick skim I think that there are only 3 such functions in Data.Text (toCaseFold, toLower, toUpper), although the 3 justification functions may handle double-width characters properly.
Anyway, my main point is that I don't think that either text or String should make it any easier for people to get things right. It's true that currently only text makes correct case-conversions easy, but only because no-one's written Data.String.to* yet.
The reason Text uses UTF16 internally is so that it can be used with the ICU library (written in C, I think) which implements all the difficult things (http://hackage.haskell.org/package/text-icu). Reimplementing all that in Haskell would be a significant undertaking. You could do the same for String, but that would have to encode and re-encode on each invokation. BTW, I checked the version history of the text package and most of the list functions existed already in Tom Harper's version that text was based on in 2009. If you look at the documentation you can see that many of the list-like functions treat some invalid characters specially, so they are different.

On Sat, Mar 24, 2012 at 1:16 PM, Ian Lynagh
Data.Text seems to think that many of them are worth reimplementing for Text. It looks like someone's systematically gone through Data.List. And in fact, very few functions there /don't/ look like they are directly equivalent to list functions.
I'm not sure why the list-inspired functions are there. It doesn't really matter. It doesn't change the fact that from a Unicode perspective they give the wrong result in most situations.
This is no more incorrect than upcase = Data.Text.map toUpper
No and that's why Bryan added a correct case-modification, case folding, etc to text.
There's no reason that there couldn't be a Data.String.toUpper corresponding to Data.Text.toUpper.
That's true. But this isn't the point we were discussing. We were discussing whether the simplification of treating strings as a list is a good thing (from an educational perspective.) I pointer out that from a correctness perspective it's wrong.
I think Heinrich meant 20% performance in a useful program, not a micro-benchmark.
I that's what he meant and given that "useful program" isn't defined, so the 20% number is completely arbitrary. -- Johan

On 03/24/2012 02:50 PM, Johan Tibell wrote:
[...] Furthermore, the memory overhead of Text is smaller, which means that applications that hold on to many string value will use less heap and thus experience smaller "freezes" due major GC collections, which are linear in the heap size.
How is Text for small strings currently (e.g. one English word, if not one character)? Can we reasonably recommend it for that? This recent question suggests it's still not great: http://stackoverflow.com/questions/9398572/memory-efficient-strings-in-haske...

On Sat, Mar 24, 2012 at 3:45 PM, Isaac Dupree
How is Text for small strings currently (e.g. one English word, if not one character)? Can we reasonably recommend it for that? This recent question suggests it's still not great: http://stackoverflow.com/questions/9398572/memory-efficient-strings-in-haske...
It's definitely not as good as it could be with the common case being 2 bytes per code point and then some fixed overhead. The UTF-8 GSoC project last summer was an attempt to see if we could do better, but unfortunately GHC does a worse job streaming out of a byte array containing utf-8 than out of a byte array containing utf-16 (due to bad branch layout.) This resulted in some performance gains and some performance losses, with some more wins and losses. As there are other engineering benefits in favor of utf-16 (e.g. being able to use ICU efficiently) we opted for not switching the decoding. If we can get GHC to the point where it compiles an utf-8 based Text really well, we could reconsider this decision. There's also a design trade-off in Text that favors better asymptotic complexity for some operations (e.g. taking substrings) that adds 2 words of overhead to every string. -- Johan
participants (23)
-
Ben Millwood
-
Brandon Allbery
-
Christian Siefkes
-
Colin Paul Adams
-
Daniel Peebles
-
David Menendez
-
Edward Kmett
-
Evan Laforge
-
Freddie Manners
-
Gabriel Dos Reis
-
Greg Weber
-
Gábor Lehel
-
Heinrich Apfelmus
-
Henrik Nilsson
-
Henrik Nilsson
-
Ian Lynagh
-
Isaac Dupree
-
Johan Tibell
-
Malcolm Wallace
-
Nate Soares
-
Simon Marlow
-
Thomas Schilling
-
Yitzchak Gale