Re: Haskell Platform Proposal: add the 'text' library

8 Sep 2010

      On Tue, Sep 7, 2010 at 6:21 PM, Duncan Coutts
wrote:
...
Here's an alternative argument: suppose we change the representation
of strict text to be a tree of chunks (e.g. finger tree). We could
achieve effecient concatenation. This representation would be
impossible while preserving semantics of a lazy tail. A tree impl that
has any kind of balance needs to know the overall length so cannot
have a lazy tail.
A minor addendum to this point, something based on a bootstrapped skew
binomial list of bytestrings can give you a lazy tail and O(log n) indexing
and drop, but it loses the cheap append which was the original motivation,
so while I find it occasionally useful as a quickly indexable, potentially
infinite list, with interesting asymptotics, I'd hesitate to propose it for
any sort of standard library.
...
Did you consider keeping the number of characters in the Text directly?
Is there a reason it couldn't be done?
There's little point. Knowing the length does not usually help you
...
save any other O(n) operations. It'd also only help for strict text,
not lazy. Just like lists, asking for the length is usually not a good
idea.
Actually it _can_ help quite a bit to know the number of characters (not
just words) present in the input:

It can tell you of the absence of surrogate pairs if the length (in
characters) is the same as the number of words. This then lets you do
indexing as an O(1) operation. I used this in a simple fingertree of utf-8
encoded bytestrings to enable faster indexing into leaves that didn't have
utf-8 tailbytes present. Since a lot of UTF-8 text has huge swathes of ASCII
content, this turned out to be a pretty big win for me. Since the vast
majority of UTF-16 text probably contains all Plane 0 (UCS2) content, you
can get similar wins. The cost is that you either have to have a sentinel
'don't know' value or pay to scan any arrays that you directly convert to
Text.

As an aside:

I want to like and use Data.Text, but I haven't been able to. Too many of
the operations devolve to O(n) that could be O(1) if I had some way to slice
based on cursors into the data type, and the current API locks me out of the
internals, so I can't write those functions myself. With ByteString at
least, if I must, I can go grab the Data.Bytestring.Internal module and have
at it. I'm all for a clean API, but a clean API that hurts asymptotics
invites a lot of re-invention on the fringes of the community.

But really the last nail in the coffin for me, is that often I want to be
able to just mmap data in from disk and use it, and rarely is my data on
disk stored in UTF-16.

Those nits aside, overall, Text provides a clean API for what it does, and
I'm completely on board with its inclusion in the platform (but I would
really really appreciate having access via a scary deprecated
please-don't-use-me Internal module to its guts!)

-Edward Kmett