On Wed, Sep 8, 2010 at 3:28 PM, Duncan Coutts <duncan.coutts@googlemail.com> wrote:
Text does not need to fit every use case. Our impression is that most
applications do not need string indexing and for the few that do, a
specialised structure is perfectly reasonable (eg using arrays of code
points, or trees of arrays of code points).
Text should be a replacement for String: a type used for general
string manipulation a common type used in interfaces for passing
strings between components. In the general rage of use cases it should
offer reasonable memory use and performance. It does not need to be
the perfect choice for the buffer of a text editor or other special
internal use cases.
It's an important point in the API design of Text that indexing is
discouraged because, given a compact variable-length internal
representation, you cannot make any useful promises about the
performance of indexing. As you point out you can sometimes do better
than O(n) but generally you cannot and it's not good to encourage a
style that often performs well but occasionally is terrible.
> As an aside:Can you be more specific. The Text API design is that substring is the
>
> I want to like and use Data.Text, but I haven't been able to. Too many of
> the operations devolve to O(n) that could be O(1) if I had some way to slice
> based on cursors into the data type, and the current API locks me out of the
> internals, so I can't write those functions myself.
"cursor" and you don't need any other index.
> With ByteString atIt's pretty important that the internal representation not be fixed by
> least, if I must, I can go grab the Data.Bytestring.Internal module and have
> at it. I'm all for a clean API, but a clean API that hurts asymptotics
> invites a lot of re-invention on the fringes of the community.
>
> But really the last nail in the coffin for me, is that often I want to be
> able to just mmap data in from disk and use it, and rarely is my data on
> disk stored in UTF-16.
being exposed in the API. In future if benchmarks demonstrate that
some different encoding offers a better balance of performance and
memory use then we need to be able to switch without breaking
everyone's programs.
Note that even if the internal encoding happened to match your
external encoding, the whole lot would still need to be forced into
memory to validate the encoding. So you would be able to share memory
with the page cache but not avoid any IO.