
On Wed, Sep 8, 2010 at 3:28 PM, Duncan Coutts
Text does not need to fit every use case. Our impression is that most applications do not need string indexing and for the few that do, a specialised structure is perfectly reasonable (eg using arrays of code points, or trees of arrays of code points).
I understand that viewpoint, and it is a perfectly reasonable scope to
choose. I was simply pointing out that it has kept me from being able to use Text for a number of applications. =)
Text should be a replacement for String: a type used for general string manipulation a common type used in interfaces for passing strings between components. In the general rage of use cases it should offer reasonable memory use and performance. It does not need to be the perfect choice for the buffer of a text editor or other special internal use cases.
It's an important point in the API design of Text that indexing is discouraged because, given a compact variable-length internal representation, you cannot make any useful promises about the performance of indexing. As you point out you can sometimes do better than O(n) but generally you cannot and it's not good to encourage a style that often performs well but occasionally is terrible.
As an aside:
I want to like and use Data.Text, but I haven't been able to. Too many of the operations devolve to O(n) that could be O(1) if I had some way to slice based on cursors into the data type, and the current API locks me out of the internals, so I can't write those functions myself.
Can you be more specific. The Text API design is that substring is the "cursor" and you don't need any other index.
I understand that design decision, but it doesn't work for my use case. An example of something I do with fingertrees of bytestrings that I can't do with text can be found in the second set of slides here: http://comonad.com/reader/2009/iteratees-parsec-and-monoid/ When playing with iteratees when you need more input it glues together the buffer onto the end of the existing buffer. If your buffer is a bytestring (or text) that can be pretty painful due to the O(n) append, and potentially large amount of read ahead in certain classes of parsers. Even if you go to lazy bytestrings/text, its kind of a crappy solution, because you are consistently appending the buffers on the wrong end of the list, so the asymptotics suffer. To that end, I would up using a variant on an iteratee which instead of directly concatenating, stored the bytestring fragments in a fingertree (or skew binary random acccess list). This fixed the asymptotics on long look ahead, in exchange for bloating my memory footprint and killing the space usage guarantees provided by Oleg's iteratees. So now I have a fingertree of buffers (be them bytestrings or text) being held on to by my iteratee. I can choose to retain all of the buffers read (rather than just the unparsed 'right hand side') in the fingertree. This lets me backtrack. Now I can implement a parsec parser where the current 'input' is just a cursor location -- an offset into the fingertree of buffers. This means that parsec's built-in try functionality works, and that I can run a parsec parser 'online'. The most common operation is to read the next character from _some_ index/input, but it doesn't always just drop a character as it goes. Futhermore, I have these cursors as meaningful references into the input buffer. I can say a lot about them. I can slice between two cursors from the same set of buffers and I can even know if they came from the same bytestring/text fragment based on the index and where each fragment starts and ends, and if so, no copying is involved at all. Since this is happening inside of a monad (Parsec), there could be a bunch of these cursors alive at any point. My memory usage dropped off _dramatically_ when I made this improvement. There doesn't exist a good equivalent under the API that I have available to me in Text, and I don't have the ability to build one over the exposed API. I can implement the common 'parsec' input pattern, but not invert control of it using the iteratee machinery to let me run it on-line, not without walking backwards and forwards over the same bits of Text repeatedly and doing all sorts of manual tracking that blows up my asymptotics. Do I believe that the Text API should expose this kind of functionality? I'd hazard not. In the above, each of my cursor positions is well formed, it always comes between two characters. Any slice made between any two valid cursors for the buffer set succeeds. To provide those guarantees you need rank 2 types and general scariness that does not belong in a Platform library. However, my first attempt at implementing this WAS on top of the Text API, until I ran into a situation where the design goal of encapsulation forced me to a full stop, and required me to go back and re-engineer on top of ByteString. ByteString has a very similar API, but is at least currently willing to expose an Internal module as a dodge. I realize there is a desire from a library encapsulation perspective to hide the internals, to make them easier to change, and to make it so the public facade isn't so brittle. I simply wish to point out that there is a cost for such abstraction. Namely that people who would otherwise happily be using a library and singing its praises can be completely locked out of it.
With ByteString at
least, if I must, I can go grab the Data.Bytestring.Internal module and have at it. I'm all for a clean API, but a clean API that hurts asymptotics invites a lot of re-invention on the fringes of the community.
But really the last nail in the coffin for me, is that often I want to be able to just mmap data in from disk and use it, and rarely is my data on disk stored in UTF-16.
It's pretty important that the internal representation not be fixed by being exposed in the API. In future if benchmarks demonstrate that some different encoding offers a better balance of performance and memory use then we need to be able to switch without breaking everyone's programs.
I understand that. However, for me as an end user, the immediate consequence of your long term intentions are that I am stuck re-inventing the wheel and I get to derive no benefit from an otherwise wonderful library. This consideration does not come without a cost. Note that even if the internal encoding happened to match your
external encoding, the whole lot would still need to be forced into memory to validate the encoding. So you would be able to share memory with the page cache but not avoid any IO.
No arguments there. Again, there is a wholehearted +1 from me for the adoption of 'text' as part of the platform. I simply felt the need to qualify that +1 with why I find myself, sadly, unable to derive any benefit from its existence. -Edward Kmett