
On 8 September 2010 19:43, Edward Kmett
Did you consider keeping the number of characters in the Text directly? Is there a reason it couldn't be done?
There's little point. Knowing the length does not usually help you save any other O(n) operations. It'd also only help for strict text, not lazy. Just like lists, asking for the length is usually not a good idea.
Actually it _can_ help quite a bit to know the number of characters (not just words) present in the input:
It can tell you of the absence of surrogate pairs if the length (in characters) is the same as the number of words. This then lets you do indexing as an O(1) operation. I used this in a simple fingertree of utf-8 encoded bytestrings to enable faster indexing into leaves that didn't have utf-8 tailbytes present. Since a lot of UTF-8 text has huge swathes of ASCII content, this turned out to be a pretty big win for me. Since the vast majority of UTF-16 text probably contains all Plane 0 (UCS2) content, you can get similar wins. The cost is that you either have to have a sentinel 'don't know' value or pay to scan any arrays that you directly convert to Text.
Text does not need to fit every use case. Our impression is that most applications do not need string indexing and for the few that do, a specialised structure is perfectly reasonable (eg using arrays of code points, or trees of arrays of code points). Text should be a replacement for String: a type used for general string manipulation a common type used in interfaces for passing strings between components. In the general rage of use cases it should offer reasonable memory use and performance. It does not need to be the perfect choice for the buffer of a text editor or other special internal use cases. It's an important point in the API design of Text that indexing is discouraged because, given a compact variable-length internal representation, you cannot make any useful promises about the performance of indexing. As you point out you can sometimes do better than O(n) but generally you cannot and it's not good to encourage a style that often performs well but occasionally is terrible.
As an aside:
I want to like and use Data.Text, but I haven't been able to. Too many of the operations devolve to O(n) that could be O(1) if I had some way to slice based on cursors into the data type, and the current API locks me out of the internals, so I can't write those functions myself.
Can you be more specific. The Text API design is that substring is the "cursor" and you don't need any other index.
With ByteString at least, if I must, I can go grab the Data.Bytestring.Internal module and have at it. I'm all for a clean API, but a clean API that hurts asymptotics invites a lot of re-invention on the fringes of the community.
But really the last nail in the coffin for me, is that often I want to be able to just mmap data in from disk and use it, and rarely is my data on disk stored in UTF-16.
It's pretty important that the internal representation not be fixed by being exposed in the API. In future if benchmarks demonstrate that some different encoding offers a better balance of performance and memory use then we need to be able to switch without breaking everyone's programs. Note that even if the internal encoding happened to match your external encoding, the whole lot would still need to be forced into memory to validate the encoding. So you would be able to share memory with the page cache but not avoid any IO. Duncan