
Hello Strings in haskell seem to be one major source of problems. I try to outline some of the problems I have faced and possible solutions. Size Handling large amounts of text as haskell strings is currently not possible as the representation (list of chars) is very inefficient. Serialization Most of the time when serializing a string, we want to handle it as an array of Word8. With lists one has to account for cycles and infinite length which make the encoding much more verbose and slow. The problem is not that some strings may not be trivially serializable, but rather that it is hard to find the easy cases. Typeclass instances It is currently hard to define typeclass instances for strings as String ( = [Char]) overlaps with [a]. Implementations provide solutions for this, but there should not be a need for workarounds in the first place. Show/Read The current Show/Read implementation makes it impossible to use with large strings. A read implementation needs to traverse the file looking for the terminating '"' and handling escape codes. A better solution would be to have an efficient (size prefixed) representation, maybe in a separate Serializable typeclass. But this would need the ablity to derive Serializable instances for algebraic datatypes automatically to avoid lots of useless code. Possible solutions The optimal solution should be compact and fast. A list of chunks is one solution - it would make it possible to e.g. mmap files (modulo encoding) and support fast concatenation. In addition one would need to support infinite and cyclic structures. Adding an alternative which corresponds to the current string abstraction would be sufficient. type CharT = Word8 data Str = S [(Ptr CharT, Int)] | I [CharT] A major problem is breaking old code. The main problem is that many functions are defined only on lists which forces strings to be lists. Making the functions polymorphic would solve a lot of problems and not only for Strings. There should be no performance penalty (at least in theory) when the compiler knows which instance is used at compile time. - Einar Karttunen

On Mon, 20 Sep 2004, Einar Karttunen wrote:
Size
Handling large amounts of text as haskell strings is currently not possible as the representation (list of chars) is very inefficient.
Efficiency is always a reason to mess everything. But the inefficiency applies to lists of every data type, so why optimizing only Strings, why not optimizing Lists in general, or better all similar data structures, as far as possible? Why not doing it in a transparent way by an optimizer in the compiler? This is certainly the more complicated task, but the more promising target for the long term. I very like to apply List functions to Strings, so the definition String = [Char] seems to me the most natural definition.
Typeclass instances
It is currently hard to define typeclass instances for strings as String ( = [Char]) overlaps with [a]. Implementations provide solutions for this, but there should not be a need for workarounds in the first place.
That's a problem, I also like to hear opinions about that. E.g. Show instance of String doesn't output ['b','l','a'] but "bla".

On 20.09 12:59, Henning Thielemann wrote:
Handling large amounts of text as haskell strings is currently not possible as the representation (list of chars) is very inefficient.
Efficiency is always a reason to mess everything. But the inefficiency applies to lists of every data type, so why optimizing only Strings, why not optimizing Lists in general, or better all similar data structures, as far as possible? Why not doing it in a transparent way by an optimizer in the compiler? This is certainly the more complicated task, but the more promising target for the long term. I very like to apply List functions to Strings, so the definition String = [Char] seems to me the most natural definition.
Optimizing all lists would be nice but choosing allways the correct behaviour would be quite hard. Of course if such an optimization would exists Strings would benefit from it. Making strings an abstract type would not preclude using such optimizations. But Strings could be optimized even before the optimization existed. The list of chars seems natural when thinking in terms of transformations, but it is not very natural when trying to interact with external world.
It is currently hard to define typeclass instances for strings as String ( = [Char]) overlaps with [a]. Implementations provide solutions for this, but there should not be a need for workarounds in the first place.
That's a problem, I also like to hear opinions about that. E.g. Show instance of String doesn't output ['b','l','a'] but "bla".
This is because Show has a special case for lists: class Show showsPrec :: Int -> a -> ShowS show :: a -> String showList :: [a] -> Shows This is not very elegant and does not help when using a boilerplate style traversal. - Einar Karttunen

G'day all.
Quoting Henning Thielemann
Efficiency is always a reason to mess everything.
OTOH, when efficiency matters, it REALLY matters. (The flip side of this is that "efficiency" doesn't always mean what you think it means.) The problem is that the current representation causes problems with text-heavy applications. Assume for a moment that Char is a 16-bit code point. Then under GHC, on a 32 bit architecture, [Char] uses four times as much memory as an unboxed array of Char. While I'm not for scrounging cycles when it's not necessary to do so, if a text-heavy Haskell program can only handle 25% of the data of the equivalent program on the same computer written in another language (say, ML), then that's a problem for Haskell. And text processing is a very large, important area these days.
But the inefficiency applies to lists of every data type, so why optimizing only Strings, why not optimizing Lists in general, or better all similar data structures, as far as possible?
That's an excellent question, and it is one of the motivations behind the various abstract container projects, which appear to have stalled. There's something to be said for the philosophy that a String is a kind of container, and you shouldn't care what kind of container, so long as it obeys these axioms.
I very like to apply List functions to Strings, so the definition String = [Char] seems to me the most natural definition.
So why would a function converting a String to a (lazy) list be inappropriate? Cheers, Andrew Bromage

[ haskell newbie alert ]
On 2004-09-20, Henning Thielemann
On Mon, 20 Sep 2004, Einar Karttunen wrote:
Size
Handling large amounts of text as haskell strings is currently not possible as the representation (list of chars) is very inefficient.
Efficiency is always a reason to mess everything. But the inefficiency
Well put.
applies to lists of every data type, so why optimizing only Strings, why not optimizing Lists in general, or better all similar data structures, as
Python has an interesting approach. Strings are not implemented as lists of characters, but they are an object of a generic type (I forget what it's called; iterable maybe) that can be accessed like a list. (Lists, of course, are also objects of that type.) I assume typeclasses could lead to the same situation in OCaml. It's not a list, but it looks and acts like a list... But anyway, having strings as lists of chars leads to some things that are convenient: * (++) is both a list and a string concatenation operator * Pattern matching works well with strings (that's my #1 gripe about strings in OCaml) * Thanks to the laziness of Haskell lists, things such as "lines" are possible -- eliminating the hassle of loops/recursion to read file data or the ugliness of reading an entire file at once. These are significant advantages to me.

G'day all.
Quoting John Goerzen
* (++) is both a list and a string concatenation operator
This could easily be handle by a typeclass.
* Pattern matching works well with strings (that's my #1 gripe about strings in OCaml)
* Thanks to the laziness of Haskell lists, things such as "lines" are possible -- eliminating the hassle of loops/recursion to read file data or the ugliness of reading an entire file at once.
These are good arguments for making Strings, however they're implemented, _convertable_ to lazy lists. Much like all of the other containers which Haskell currently implements. Cheers, Andrew Bromage

On Mon, Sep 20, 2004 at 01:11:34PM +0300, Einar Karttunen wrote:
Size
Handling large amounts of text as haskell strings is currently not possible as the representation (list of chars) is very inefficient.
You know about the PackedString functions, right? http://www.haskell.org/ghc/docs/6.0/html/base/Data.PackedString.html Peace, Dylan

On 20.09 15:05, Dylan Thurston wrote:
You know about the PackedString functions, right?
http://www.haskell.org/ghc/docs/6.0/html/base/Data.PackedString.html
Yes, but they are quite broken. I am using FastPackedString from darcs for many purposes, which is like PackedString in many ways. PackedStrings use full unicode codepoints (4*size in bytes), but having a char > 256 in them does not work if one wants to do IO. This means essentially that the wasted space cannot be used in any meaningfull way. Also concatenating PackedStrings is not very nice: concatPS pss = packString (concat (map unpackPS pss)) And most important they need a conversion (unpackPS), before using them with external libraries which expect Strings. - Einar Karttunen
participants (5)
-
ajb@spamcop.net
-
dpt@bostoncoop.net
-
Einar Karttunen
-
Henning Thielemann
-
John Goerzen