Re: Haskell Platform Proposal: add the 'text' library

8 Sep 2010

      On Wed, Sep 8, 2010 at 3:28 PM, Duncan Coutts
wrote:
...
Text does not need to fit every use case. Our impression is that most
applications do not need string indexing and for the few that do, a
specialised structure is perfectly reasonable (eg using arrays of code
points, or trees of arrays of code points).
I understand that viewpoint, and it is a perfectly reasonable scope to
choose. I was simply pointing out that it has kept me from being able to use
Text for a number of applications. =)
...
Text should be a replacement for String: a type used for general
string manipulation a common type used in interfaces for passing
strings between components. In the general rage of use cases it should
offer reasonable memory use and performance. It does not need to be
the perfect choice for the buffer of a text editor or other special
internal use cases.
It's an important point in the API design of Text that indexing is
discouraged because, given a compact variable-length internal
representation, you cannot make any useful promises about the
performance of indexing. As you point out you can sometimes do better
than O(n) but generally you cannot and it's not good to encourage a
style that often performs well but occasionally is terrible.

...
...
As an aside:
I want to like and use Data.Text, but I haven't been able to. Too many of
the operations devolve to O(n) that could be O(1) if I had some way to
slice
based on cursors into the data type, and the current API locks me out of
the
internals, so I can't write those functions myself.
Can you be more specific. The Text API design is that substring is the
"cursor" and you don't need any other index.
I understand that design decision, but it doesn't work for my use case.

An example of something I do with fingertrees of bytestrings that I can't do
with text can be found in the second set of slides here:

http://comonad.com/reader/2009/iteratees-parsec-and-monoid/

When playing with iteratees when you need more input it glues together the
buffer onto the end of the existing buffer. If your buffer is a bytestring
(or text) that can be pretty painful due to the O(n) append, and potentially
large amount of read ahead in certain classes of parsers. Even if you go to
lazy bytestrings/text, its kind of a crappy solution, because you are
consistently appending the buffers on the wrong end of the list, so the
asymptotics suffer.

To that end, I would up using a variant on an iteratee which instead of
directly concatenating, stored the bytestring fragments in a fingertree (or
skew binary random acccess list). This fixed the asymptotics on long look
ahead, in exchange for bloating my memory footprint and killing the space
usage guarantees provided by Oleg's iteratees.

So now I have a fingertree of buffers (be them bytestrings or text) being
held on to by my iteratee. I can choose to retain all of the buffers read
(rather than just the unparsed 'right hand side') in the fingertree. This
lets me backtrack. Now I can implement a parsec parser where the current
'input' is just a cursor location -- an offset into the fingertree of
buffers.

This means that parsec's built-in try functionality works, and that I can
run a parsec parser 'online'. The most common operation is to read the next
character from _some_ index/input, but it doesn't always just drop a
character as it goes. Futhermore, I have these cursors as meaningful
references into the input buffer. I can say a lot about them.

I can slice between two cursors from the same set of buffers and I can even
know if they came from the same bytestring/text fragment based on the index
and where each fragment starts and ends, and if so, no copying is involved
at all. Since this is happening inside of a monad (Parsec), there could be a
bunch of these cursors alive at any point. My memory usage dropped off
_dramatically_ when I made this improvement.

There doesn't exist a good equivalent under the API that I have available to
me in Text, and I don't have the ability to build one over the exposed API.
I can implement the common 'parsec' input pattern, but not invert control of
it using the iteratee machinery to let me run it on-line, not without
walking backwards and forwards over the same bits of Text repeatedly and
doing all sorts of manual tracking that blows up my asymptotics.

Do I believe that the Text API should expose this kind of functionality? I'd
hazard not. In the above, each of my cursor positions is well formed, it
always comes between two characters. Any slice made between any two valid
cursors for the buffer set succeeds. To provide those guarantees you need
rank 2 types and general scariness that does not belong in a Platform
library.

However, my first attempt at implementing this WAS on top of the Text API,
until I ran into a situation where the design goal of encapsulation forced
me to a full stop, and required me to go back and re-engineer on top of
ByteString. ByteString has a very similar API, but is at least currently
willing to expose an Internal module as a dodge.

I realize there is a desire from a library encapsulation perspective to hide
the internals, to make them easier to change, and to make it so the public
facade isn't so brittle. I simply wish to point out that there is a cost for
such abstraction. Namely that people who would otherwise happily be using a
library and singing its praises can be completely locked out of it.
...
With ByteString at
...
least, if I must, I can go grab the Data.Bytestring.Internal module and
have
at it. I'm all for a clean API, but a clean API that hurts asymptotics
invites a lot of re-invention on the fringes of the community.
But really the last nail in the coffin for me, is that often I want to be
able to just mmap data in from disk and use it, and rarely is my data on
disk stored in UTF-16.
It's pretty important that the internal representation not be fixed by
being exposed in the API. In future if benchmarks demonstrate that
some different encoding offers a better balance of performance and
memory use then we need to be able to switch without breaking
everyone's programs.
I understand that. However, for me as an end user, the immediate consequence
of your long term intentions are that I am stuck re-inventing the wheel and
I get to derive no benefit from an otherwise wonderful library.  This
consideration does not come without a cost.

Note that even if the internal encoding happened to match your
...
external encoding, the whole lot would still need to be forced into
memory to validate the encoding. So you would be able to share memory
with the page cache but not avoid any IO.
No arguments there.

Again, there is a wholehearted +1 from me for the adoption of 'text' as part
of the platform. I simply felt the need to qualify that +1 with why I find
myself, sadly, unable to derive any benefit from its existence.

-Edward Kmett