New subject: Re[2]: Haskell Platform Proposal: add the 'text' library

7 Sep 2010

      = Proposal: Add Data.Text to the Haskell Platform =

Maintainer: Bryan O'Sullivan (submitted with his approval)

== Introduction ==

This is a proposal for the 'text' package to be included in the next
major release of the Haskell platform.

An up to date copy of this text is kept at:

    http://trac.haskell.org/haskell-platform/wiki/Proposals/text

Everyone is invited to review this proposal, following the standard
procedure for proposing and reviewing packages.

    http://trac.haskell.org/haskell-platform/wiki/AddingPackages

Review comments should be sent to the libraries mailing list by
October 1 so that we have time to discuss and resolve issues
before the final deadline on November 1.

    http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable 

== Credits ==

Proposal author and package maintainer: Bryan O'Sullivan, originally by
Tom Harper, based on ByteString and Vector (fusion) packages.

The following individuals contributed to the review process: Don
Stewart, Johan Tibell

== Abstract ==

The 'text' package provides an efficient packed, immutable Unicode text type
(both strict and lazy), with a powerful loop fusion optimization framework.

The 'Text' type represents Unicode character strings, in a time and
space-efficient manner. This package provides text processing
capabilities that are optimized for performance critical use, both
in terms of large data quantities and high speed.

The 'Text' type provides character-encoding, type-safe case
conversion via whole-string case conversion functions. It also
provides a range of functions for converting Text values to and from
'ByteStrings', using several standard encodings (see the 'text-icu'
package for a much larger variety of encoding functions).

Efficient locale-sensitive support for text IO is also supported.

This module is intended to be imported qualified, to avoid name
clashes with Prelude functions, e.g.

    import qualified Data.Text as T

Documentation and tarball from the hackage page:

    http://hackage.haskell.org/package/text

Development repo:

    darcs get http://code.haskell.org/text/

== Rationale ==

While Haskell's Char type is capable of reprenting Unicode code points, the
String sequence of such Chars has some drawbacks that prevent is general
use:

 1. unicode-unaware case conversion (map toUpper is an unsafe case conversion)
 2. the representation is space inefficient.
 3. the data structure is element-level lazy, whereas a number of
   applications require either some level of additional strictness

An intermediate solution to these was via 'Data.ByteString' (an
efficient byte sequence type, that addresses points 2 and 3), which,
when used in conjunction with utf8-string, provides very simple
non-latin1 encoding support (though with significant drawbacks in terms
of locale and encoding range).

The 'text' package addresses these shortcomings in a number of way:

 1. support whole-string case conversion (thus, type correct unicode
    transformations) 
 2. a space and time efficient representation, based on unboxed Word16
    arrays
 3. either fully strict, or chunk-level lazy data types (in the style of
    Data.ByteString)
 4. full support for locale-sensitive, encoding-aware IO.

The 'text' library has rapidly become popular for a number of
applications, and is used by more than 50 other Hackage packages. As of
Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular),
in particular, in web programming. It is used by:

 * the blaze html pretty printing library
 * the hstringtemplate file templating library
 * *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks
 * the hexpat and libxml xml parsers

The design is based on experience from Data.Vector and Data.ByteString:

 * the underlying type is based on unpinned, packed arrays on the Haskell heap
    with an ST interface for memory effects.
 * pipelines of operations are optimized via converstion to and from the
   'stream' abstraction[1]

== The API ==

The API is broken into several logical pieces, which are
self-explanatory:

 * combinators for operating on strict, abstract 'text's.
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text....

 * an equivalent API for chunk-element lazy 'text's.
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-...

 * encoding transformations, to and from bytestrings:
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-...

 * support for conversion to Ptr Word16:
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-...

 * locale-aware IO layer:
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-...
        http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-...

== Design decisions ==

 * IO and pure combinators are in separate modules.
 * Both a fully strict, and partially-strict type are provided.
 * The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user.
 * Unpinned arrays are used, to prevent fragmentation.
 * Large numbers of additional encodings are delegated to the text-icu package.
 * An 'IsString' instance is provided for String literals.
 * The implementation is OS and architecture neutral (portable).
 * The implementation uses a number of language extensions:

    CPP
    MagicHash
    UnboxedTuples
    BangPatterns
    Rank2Types
    RecordWildCards
    ScopedTypeVariables
    ExistentialQuantification
    DeriveDataTypeable

 * The implementation is entirely Haskell (no additional C code or libraries).
 * The package provides a QuickCheck/HUnit testsuite, and coverage data.
 * The package adds no new dependencies to the HP.
 * The package builds with the Simple cabal way.
 * There is no existing functionality for packed unicode text in the HP.
 * The package has complexity annotations.

== Open issues ==

The text-icu package is not part of this propposal.

== Notes ==

The implementation consists of 30 modules, and relies on cabal's package
hiding mechanism to expose only 5 modules. The implementation is around
8000 lines of text total.

The public modules expose none of these (?).

The Python standard library provides both a string and a unicode
sequence type. These are somewhat analogous to the
ByteString/String/Text split.

= References =

[1]: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts,
     Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.

Haskell Platform Proposal: add the 'text' library

Johan Tibell

Johan Tibell

Bulat Ziganshin

wren ng thornton

wren ng thornton

wren ng thornton

wren ng thornton

Johan Tibell

Johan Tibell

Johan Tibell

Vincent Hanquez

Vincent Hanquez

Vincent Hanquez

Johan Tibell

tags

participants (25)