Re: Haskell Platform Proposal: add the 'text' library

8 Sep 2010

      On 8 September 2010 19:43, Edward Kmett  wrote:
...
...
Did you consider keeping the number of characters in the Text directly?
Is there a reason it couldn't be done?
...
There's little point. Knowing the length does not usually help you
save any other O(n) operations. It'd also only help for strict text,
not lazy. Just like lists, asking for the length is usually not a good
idea.
Actually it _can_ help quite a bit to know the number of characters (not
just words) present in the input:
...
It can tell you of the absence of surrogate pairs if the length (in
characters) is the same as the number of words. This then lets you do
indexing as an O(1) operation. I used this in a simple fingertree of utf-8
encoded bytestrings to enable faster indexing into leaves that didn't have
utf-8 tailbytes present. Since a lot of UTF-8 text has huge swathes of ASCII
content, this turned out to be a pretty big win for me. Since the vast
majority of UTF-16 text probably contains all Plane 0 (UCS2) content, you
can get similar wins. The cost is that you either have to have a sentinel
'don't know' value or pay to scan any arrays that you directly convert to
Text.
Text does not need to fit every use case. Our impression is that most
applications do not need string indexing and for the few that do, a
specialised structure is perfectly reasonable (eg using arrays of code
points, or trees of arrays of code points).

Text should be a replacement for String: a type used for general
string manipulation a common type used in interfaces for passing
strings between components. In the general rage of use cases it should
offer reasonable memory use and performance. It does not need to be
the perfect choice for the buffer of a text editor or other special
internal use cases.

It's an important point in the API design of Text that indexing is
discouraged because, given a compact variable-length internal
representation, you cannot make any useful promises about the
performance of indexing. As you point out you can sometimes do better
than O(n) but generally you cannot and it's not good to encourage a
style that often performs well but occasionally is terrible.
...
As an aside:
I want to like and use Data.Text, but I haven't been able to. Too many of
the operations devolve to O(n) that could be O(1) if I had some way to slice
based on cursors into the data type, and the current API locks me out of the
internals, so I can't write those functions myself.
Can you be more specific. The Text API design is that substring is the
"cursor" and you don't need any other index.
...
With ByteString at
least, if I must, I can go grab the Data.Bytestring.Internal module and have
at it. I'm all for a clean API, but a clean API that hurts asymptotics
invites a lot of re-invention on the fringes of the community.
But really the last nail in the coffin for me, is that often I want to be
able to just mmap data in from disk and use it, and rarely is my data on
disk stored in UTF-16.
It's pretty important that the internal representation not be fixed by
being exposed in the API. In future if benchmarks demonstrate that
some different encoding offers a better balance of performance and
memory use then we need to be able to switch without breaking
everyone's programs.

Note that even if the internal encoding happened to match your
external encoding, the whole lot would still need to be forced into
memory to validate the encoding. So you would be able to share memory
with the page cache but not avoid any IO.

Duncan

Re: Haskell Platform Proposal: add the 'text' library

Duncan Coutts