
On Tue, Sep 7, 2010 at 6:21 PM, Duncan Coutts
Here's an alternative argument: suppose we change the representation of strict text to be a tree of chunks (e.g. finger tree). We could achieve effecient concatenation. This representation would be impossible while preserving semantics of a lazy tail. A tree impl that has any kind of balance needs to know the overall length so cannot have a lazy tail.
A minor addendum to this point, something based on a bootstrapped skew binomial list of bytestrings can give you a lazy tail and O(log n) indexing and drop, but it loses the cheap append which was the original motivation, so while I find it occasionally useful as a quickly indexable, potentially infinite list, with interesting asymptotics, I'd hesitate to propose it for any sort of standard library.
Did you consider keeping the number of characters in the Text directly? Is there a reason it couldn't be done?
There's little point. Knowing the length does not usually help you
save any other O(n) operations. It'd also only help for strict text, not lazy. Just like lists, asking for the length is usually not a good idea.
Actually it _can_ help quite a bit to know the number of characters (not just words) present in the input: It can tell you of the absence of surrogate pairs if the length (in characters) is the same as the number of words. This then lets you do indexing as an O(1) operation. I used this in a simple fingertree of utf-8 encoded bytestrings to enable faster indexing into leaves that didn't have utf-8 tailbytes present. Since a lot of UTF-8 text has huge swathes of ASCII content, this turned out to be a pretty big win for me. Since the vast majority of UTF-16 text probably contains all Plane 0 (UCS2) content, you can get similar wins. The cost is that you either have to have a sentinel 'don't know' value or pay to scan any arrays that you directly convert to Text. As an aside: I want to like and use Data.Text, but I haven't been able to. Too many of the operations devolve to O(n) that could be O(1) if I had some way to slice based on cursors into the data type, and the current API locks me out of the internals, so I can't write those functions myself. With ByteString at least, if I must, I can go grab the Data.Bytestring.Internal module and have at it. I'm all for a clean API, but a clean API that hurts asymptotics invites a lot of re-invention on the fringes of the community. But really the last nail in the coffin for me, is that often I want to be able to just mmap data in from disk and use it, and rarely is my data on disk stored in UTF-16. Those nits aside, overall, Text provides a clean API for what it does, and I'm completely on board with its inclusion in the platform (but I would really really appreciate having access via a scary deprecated please-don't-use-me Internal module to its guts!) -Edward Kmett