Efficient string construction

(I've done a basic Google search on this with no results. Apologies if this has been asked before.) I am coding a web application in which the content is a Unicode string built up over multiple functions and maintained in a State structure. I gather that the String module is inefficient and that Data.Text would be a better choice. Is it more efficient to build up a list of Text objects over time and combine them together with a single Data.Text.concat for the final output or to run Data.Text.append for each new string so that I am maintaining a single Text object rather than a list? As Data.Text.append requires copying both strings each time, my gut feeling is that concat would be much more efficient, but Haskell has surprised me before, so I wanted to check. Kevin

On Thursday 03 June 2010 16:03:11, Kevin Jardine wrote:
(I've done a basic Google search on this with no results. Apologies if this has been asked before.)
I am coding a web application in which the content is a Unicode string built up over multiple functions and maintained in a State structure.
I gather that the String module is inefficient and that Data.Text would be a better choice.
Is it more efficient to build up a list of Text objects over time and combine them together with a single Data.Text.concat for the final output or to run Data.Text.append for each new string so that I am maintaining a single Text object rather than a list?
As Data.Text.append requires copying both strings each time, my gut feeling is that concat would be much more efficient, but Haskell has surprised me before, so I wanted to check.
Kevin
I'd say, use Data.Text.Lazy and its 'fromChunks' function if you produce the string chunkwise. That avoids copying. Perhaps Data.ByteString[.Lazy].UTF8 is an even better choice than Data.Text (depends on what you do).

--- On Thu, 6/3/10, Daniel Fischer
Perhaps Data.ByteString[.Lazy].UTF8 is an even better choice than Data.Text (depends on what you do).
I thought that I had the differences between the three libraries figured out but I guess not now from what you say. I had thought that String was a simple but memory inefficient model, that Text was for, well text, and that bytestrings were for binary data (eg. images, audio files and applications that required a true view on each text byte). So why is there a UTF8 implementation for bytestrings? Does that not duplicate what Text is trying to do? If so, why the duplication? When is each library more appropriate? Kevin

On Thursday 03 June 2010 17:26:36, Kevin Jardine wrote:
--- On Thu, 6/3/10, Daniel Fischer
wrote: Perhaps Data.ByteString[.Lazy].UTF8 is an even better choice than Data.Text (depends on what you do).
I thought that I had the differences between the three libraries figured out but I guess not now from what you say.
I had thought that String was a simple but memory inefficient model, that Text was for, well text, and that bytestrings were for binary data (eg. images, audio files and applications that required a true view on each text byte).
Well, not necessarily. String can be quite memory efficient. As a stupid example, length (replicate 10000000 'a') will need less memory than the equivalents using ByteString or Text. Less stupidly, if the String is lazily produced and consumed from head to last, String is memory efficient. And it's not necessarily much slower than ByteString or Text. In fact, String is sometimes faster than Text (cf. e.g. http://www.haskell.org/pipermail/haskell-cafe/2010-May/078220.html and following). When you have to deal with text that is ASCII or latin1 (or some other encoding with a byte <-> char correspondence), plain ByteStrings are usually by far the fastest method. But that's of course a severe restriction.
So why is there a UTF8 implementation for bytestrings? Does that not duplicate what Text is trying to do? If so, why the duplication?
I think Data.ByteString.UTF8 predates Data.Text.
When is each library more appropriate?
Generally, ByteString for binary data or text, when you know it's safe and you need the speed. For text, either String or Data.Text may be the better choice. IIRC, Data.Text uses utf-16 (or some other 16-bit encoding), so if you receive utf-8 encoded text, Data.ByteString.UTF8 can be the better choice. I haven't much experience with either Data.Text or Data.ByteString.UTF8, so I can't say much about their relative merits.
Kevin

On Thu, Jun 3, 2010 at 9:16 AM, Daniel Fischer
String can be quite memory efficient. As a stupid example,
length (replicate 10000000 'a')
will need less memory than the equivalents using ByteString or Text.
Actually, this will be fused with Data.Text, and should execute more quickly and in less space than String.

On Friday 04 June 2010 00:41:58, Bryan O'Sullivan wrote:
On Thu, Jun 3, 2010 at 9:16 AM, Daniel Fischer
wrote: String can be quite memory efficient. As a stupid example,
length (replicate 10000000 'a')
will need less memory than the equivalents using ByteString or Text.
Actually, this will be fused with Data.Text, and should execute more quickly and in less space than String.
Right, forgot about fusion. However, that requires the code to be compiled with optimisations, I think (well, one should never compile ByteString or Text code without). In which case {-# RULES "length/replicate" forall n x. length (replicate n x) = max 0 n #-} would be at least as good as the Data.Text thing ;) Or, to be more fair, if you use Data.List.Stream, it should be fused too and be equally efficient as Data.Text.

Daniel Fischer
So why is there a UTF8 implementation for bytestrings? Does that not duplicate what Text is trying to do? If so, why the duplication?
I think Data.ByteString.UTF8 predates Data.Text.
One difference is that Data.Text uses UTF-16 internally, not UTF-8.
When is each library more appropriate?
Much data is overwhelmingly ASCII, but with an option for non-ASCII in comments, labels, or similar. E.g., for biological sequence data, files can be large (the human genome is about 3GB) and non-ascii characters can only occur in sequence headers which constitute a miniscule fraction of the total data. So I use ByteString for this. -k -- If I haven't seen further, it is by standing in the footprints of giants

You might also look at Data.Rope from the rope library, which provides an
O(1) append for strict bytestring chunks, and the ability to decode UTF-8
chars from the result.
http://hackage.haskell.org/packages/archive/rope/0.6.1/doc/html/Data-Rope.ht...
I'd also be happy to work with you if the current API falls short of your
needs.
-Edward Kmett
On Thu, Jun 3, 2010 at 10:03 AM, Kevin Jardine
(I've done a basic Google search on this with no results. Apologies if this has been asked before.)
I am coding a web application in which the content is a Unicode string built up over multiple functions and maintained in a State structure.
I gather that the String module is inefficient and that Data.Text would be a better choice.
Is it more efficient to build up a list of Text objects over time and combine them together with a single Data.Text.concat for the final output or to run Data.Text.append for each new string so that I am maintaining a single Text object rather than a list?
As Data.Text.append requires copying both strings each time, my gut feeling is that concat would be much more efficient, but Haskell has surprised me before, so I wanted to check.
Kevin
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (5)
-
Bryan O'Sullivan
-
Daniel Fischer
-
Edward Kmett
-
Ketil Malde
-
Kevin Jardine