Haskell Platform Proposal: add the 'text' library

= Proposal: Add Data.Text to the Haskell Platform = Maintainer: Bryan O'Sullivan (submitted with his approval) == Introduction == This is a proposal for the 'text' package to be included in the next major release of the Haskell platform. An up to date copy of this text is kept at: http://trac.haskell.org/haskell-platform/wiki/Proposals/text Everyone is invited to review this proposal, following the standard procedure for proposing and reviewing packages. http://trac.haskell.org/haskell-platform/wiki/AddingPackages Review comments should be sent to the libraries mailing list by October 1 so that we have time to discuss and resolve issues before the final deadline on November 1. http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable == Credits == Proposal author and package maintainer: Bryan O'Sullivan, originally by Tom Harper, based on ByteString and Vector (fusion) packages. The following individuals contributed to the review process: Don Stewart, Johan Tibell == Abstract == The 'text' package provides an efficient packed, immutable Unicode text type (both strict and lazy), with a powerful loop fusion optimization framework. The 'Text' type represents Unicode character strings, in a time and space-efficient manner. This package provides text processing capabilities that are optimized for performance critical use, both in terms of large data quantities and high speed. The 'Text' type provides character-encoding, type-safe case conversion via whole-string case conversion functions. It also provides a range of functions for converting Text values to and from 'ByteStrings', using several standard encodings (see the 'text-icu' package for a much larger variety of encoding functions). Efficient locale-sensitive support for text IO is also supported. This module is intended to be imported qualified, to avoid name clashes with Prelude functions, e.g. import qualified Data.Text as T Documentation and tarball from the hackage page: http://hackage.haskell.org/package/text Development repo: darcs get http://code.haskell.org/text/ == Rationale == While Haskell's Char type is capable of reprenting Unicode code points, the String sequence of such Chars has some drawbacks that prevent is general use: 1. unicode-unaware case conversion (map toUpper is an unsafe case conversion) 2. the representation is space inefficient. 3. the data structure is element-level lazy, whereas a number of applications require either some level of additional strictness An intermediate solution to these was via 'Data.ByteString' (an efficient byte sequence type, that addresses points 2 and 3), which, when used in conjunction with utf8-string, provides very simple non-latin1 encoding support (though with significant drawbacks in terms of locale and encoding range). The 'text' package addresses these shortcomings in a number of way: 1. support whole-string case conversion (thus, type correct unicode transformations) 2. a space and time efficient representation, based on unboxed Word16 arrays 3. either fully strict, or chunk-level lazy data types (in the style of Data.ByteString) 4. full support for locale-sensitive, encoding-aware IO. The 'text' library has rapidly become popular for a number of applications, and is used by more than 50 other Hackage packages. As of Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular), in particular, in web programming. It is used by: * the blaze html pretty printing library * the hstringtemplate file templating library * *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks * the hexpat and libxml xml parsers The design is based on experience from Data.Vector and Data.ByteString: * the underlying type is based on unpinned, packed arrays on the Haskell heap with an ST interface for memory effects. * pipelines of operations are optimized via converstion to and from the 'stream' abstraction[1] == The API == The API is broken into several logical pieces, which are self-explanatory: * combinators for operating on strict, abstract 'text's. http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text.... * an equivalent API for chunk-element lazy 'text's. http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-... * encoding transformations, to and from bytestrings: http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-... * support for conversion to Ptr Word16: http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-... * locale-aware IO layer: http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-... http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-... == Design decisions == * IO and pure combinators are in separate modules. * Both a fully strict, and partially-strict type are provided. * The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user. * Unpinned arrays are used, to prevent fragmentation. * Large numbers of additional encodings are delegated to the text-icu package. * An 'IsString' instance is provided for String literals. * The implementation is OS and architecture neutral (portable). * The implementation uses a number of language extensions: CPP MagicHash UnboxedTuples BangPatterns Rank2Types RecordWildCards ScopedTypeVariables ExistentialQuantification DeriveDataTypeable * The implementation is entirely Haskell (no additional C code or libraries). * The package provides a QuickCheck/HUnit testsuite, and coverage data. * The package adds no new dependencies to the HP. * The package builds with the Simple cabal way. * There is no existing functionality for packed unicode text in the HP. * The package has complexity annotations. == Open issues == The text-icu package is not part of this propposal. == Notes == The implementation consists of 30 modules, and relies on cabal's package hiding mechanism to expose only 5 modules. The implementation is around 8000 lines of text total. The public modules expose none of these (?). The Python standard library provides both a string and a unicode sequence type. These are somewhat analogous to the ByteString/String/Text split. = References = [1]: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts, Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany.

On 7 September 2010 16:26, Don Stewart
= Proposal: Add Data.Text to the Haskell Platform =
== The API ==
The API is broken into several logical pieces, which are self-explanatory:
I just want to point out to people who would like to review the API that Bryan recently released version 0.8 which contains some minor API changes from version 0.7. So you should look at the 0.8 haddock documentation rather than the 0.7 versions linked in the initial proposal. Don: since you're making the proposal, can you update the wiki version of the proposal with links to the latest API docs. http://trac.haskell.org/haskell-platform/wiki/Proposals/text Duncan

I see that the text package provides its own encoding/decoding
functions. This overlaps with the Unicode API offered from the base
package. The API in base is oriented towards encoding/decoding of text
when doing file IO but definitely the conversion utils should be
reused. I implemented myself conversion functions using the internal
API:
http://code.haskell.org/gf/src/compiler/GF/Text/Coding.hs
This is String <-> ByteString conversion but it could work with Text as well.
This is mainly implementation issue but if we add text to Haskell
Platform then it will be harder to change the API later if that is
needed in order to reuse the API from base. For instance in base there
is a notion of TextEncoding which I don't see in text.
Regards,
Krasimir
2010/9/7 Don Stewart
= Proposal: Add Data.Text to the Haskell Platform =
Maintainer: Bryan O'Sullivan (submitted with his approval)
== Introduction ==
This is a proposal for the 'text' package to be included in the next major release of the Haskell platform.
An up to date copy of this text is kept at:
http://trac.haskell.org/haskell-platform/wiki/Proposals/text
Everyone is invited to review this proposal, following the standard procedure for proposing and reviewing packages.
http://trac.haskell.org/haskell-platform/wiki/AddingPackages
Review comments should be sent to the libraries mailing list by October 1 so that we have time to discuss and resolve issues before the final deadline on November 1.
http://trac.haskell.org/haskell-platform/wiki/ReleaseTimetable
== Credits ==
Proposal author and package maintainer: Bryan O'Sullivan, originally by Tom Harper, based on ByteString and Vector (fusion) packages.
The following individuals contributed to the review process: Don Stewart, Johan Tibell
== Abstract ==
The 'text' package provides an efficient packed, immutable Unicode text type (both strict and lazy), with a powerful loop fusion optimization framework.
The 'Text' type represents Unicode character strings, in a time and space-efficient manner. This package provides text processing capabilities that are optimized for performance critical use, both in terms of large data quantities and high speed.
The 'Text' type provides character-encoding, type-safe case conversion via whole-string case conversion functions. It also provides a range of functions for converting Text values to and from 'ByteStrings', using several standard encodings (see the 'text-icu' package for a much larger variety of encoding functions).
Efficient locale-sensitive support for text IO is also supported.
This module is intended to be imported qualified, to avoid name clashes with Prelude functions, e.g.
import qualified Data.Text as T
Documentation and tarball from the hackage page:
http://hackage.haskell.org/package/text
Development repo:
darcs get http://code.haskell.org/text/
== Rationale ==
While Haskell's Char type is capable of reprenting Unicode code points, the String sequence of such Chars has some drawbacks that prevent is general use:
1. unicode-unaware case conversion (map toUpper is an unsafe case conversion) 2. the representation is space inefficient. 3. the data structure is element-level lazy, whereas a number of applications require either some level of additional strictness
An intermediate solution to these was via 'Data.ByteString' (an efficient byte sequence type, that addresses points 2 and 3), which, when used in conjunction with utf8-string, provides very simple non-latin1 encoding support (though with significant drawbacks in terms of locale and encoding range).
The 'text' package addresses these shortcomings in a number of way:
1. support whole-string case conversion (thus, type correct unicode transformations) 2. a space and time efficient representation, based on unboxed Word16 arrays 3. either fully strict, or chunk-level lazy data types (in the style of Data.ByteString) 4. full support for locale-sensitive, encoding-aware IO.
The 'text' library has rapidly become popular for a number of applications, and is used by more than 50 other Hackage packages. As of Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular), in particular, in web programming. It is used by:
* the blaze html pretty printing library * the hstringtemplate file templating library * *all* popular web frameworks: happstack, snap, salvia and yesod web frameworks * the hexpat and libxml xml parsers
The design is based on experience from Data.Vector and Data.ByteString:
* the underlying type is based on unpinned, packed arrays on the Haskell heap with an ST interface for memory effects. * pipelines of operations are optimized via converstion to and from the 'stream' abstraction[1]
== The API ==
The API is broken into several logical pieces, which are self-explanatory:
* combinators for operating on strict, abstract 'text's. http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text....
* an equivalent API for chunk-element lazy 'text's. http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-...
* encoding transformations, to and from bytestrings: http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-...
* support for conversion to Ptr Word16: http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-...
* locale-aware IO layer: http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-... http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-...
== Design decisions ==
* IO and pure combinators are in separate modules. * Both a fully strict, and partially-strict type are provided. * The underlying optimization framework is stream fusion, (not build/foldr), and is hidden from the user. * Unpinned arrays are used, to prevent fragmentation. * Large numbers of additional encodings are delegated to the text-icu package. * An 'IsString' instance is provided for String literals. * The implementation is OS and architecture neutral (portable). * The implementation uses a number of language extensions:
CPP MagicHash UnboxedTuples BangPatterns Rank2Types RecordWildCards ScopedTypeVariables ExistentialQuantification DeriveDataTypeable
* The implementation is entirely Haskell (no additional C code or libraries). * The package provides a QuickCheck/HUnit testsuite, and coverage data. * The package adds no new dependencies to the HP. * The package builds with the Simple cabal way. * There is no existing functionality for packed unicode text in the HP. * The package has complexity annotations.
== Open issues ==
The text-icu package is not part of this propposal.
== Notes ==
The implementation consists of 30 modules, and relies on cabal's package hiding mechanism to expose only 5 modules. The implementation is around 8000 lines of text total.
The public modules expose none of these (?).
The Python standard library provides both a string and a unicode sequence type. These are somewhat analogous to the ByteString/String/Text split.
= References =
[1]: "Stream Fusion: From Lists to Streams to Nothing at All", Coutts, Leshchinskiy and Stewart, ICFP 2007, Freiburg, Germany. _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

On Tue, Sep 7, 2010 at 2:06 PM, Krasimir Angelov
I see that the text package provides its own encoding/decoding functions. This overlaps with the Unicode API offered from the base package.
It doesn't really. As you note, the stuff in base is all tied to I/O on Handles, whereas the functions in the text package are pure.
The API in base is oriented towards encoding/decoding of text when doing file IO but definitely the conversion utils should be reused.
Unfortunately, that's not possible, as it would break backwards compatibility with 6.10, which some industrial users still need.

2010/9/8 Bryan O'Sullivan
It doesn't really. As you note, the stuff in base is all tied to I/O on Handles, whereas the functions in the text package are pure.
It doesn't mean that it is not possible to make them pure. The operations are pure in nature.
Unfortunately, that's not possible, as it would break backwards compatibility with 6.10, which some industrial users still need.
I guess you mean that then the text package will not work with 6.10. I prefer to have some intermediate version of text for compatibility rather than to nail down this lack of synergy forever. I still think that it is premature to add the text package if it is not in synchrony with the existing packages.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 9/7/10 23:37 , Krasimir Angelov wrote:
2010/9/8 Bryan O'Sullivan
: It doesn't really. As you note, the stuff in base is all tied to I/O on Handles, whereas the functions in the text package are pure.
It doesn't mean that it is not possible to make them pure. The operations are pure in nature.
Unfortunately, that's not possible, as it would break backwards compatibility with 6.10, which some industrial users still need.
I guess you mean that then the text package will not work with 6.10. I
He means that text won't work with 6.10 *if* it's changed to use the conversion routines that only exist in 6.12+ as you seem to be demanding. - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyHB8sACgkQIn7hlCsL25UiAACePqrK1E+cor7KPUVfE+BSoi20 wggAoKuwOzGDHtz0roRS4b0IbPJrCBG8 =fpmo -----END PGP SIGNATURE-----

2010/9/8 Brandon S Allbery KF8NH
He means that text won't work with 6.10 *if* it's changed to use the conversion routines that only exist in 6.12+ as you seem to be demanding.
Exactly. But it is probably possible to make version of text which with 6.10 uses some copy of the routines and with 6.12 uses the routines in base. The compatibility issue is only temporary i.e. until there are many users of 6.10. The API from text will have to stay forever. For now at least the API should be made compatible with base. For example something like that: encode :: TextEncoding -> Text -> ByteString decode :: TextEncoding -> ByteString -> Text where TextEncoding could be defined in the text package when it is compiled with GHC 6.10 or just reexported from base when it is compiled with GHC 6.12. Krasimir

On 8 September 2010 13:58, Krasimir Angelov
2010/9/8 Brandon S Allbery KF8NH
: He means that text won't work with 6.10 *if* it's changed to use the conversion routines that only exist in 6.12+ as you seem to be demanding.
Exactly. But it is probably possible to make version of text which with 6.10 uses some copy of the routines and with 6.12 uses the routines in base.
I highly doubt that you can simply copy/paste the IO stuff from 6.12+ to use with 6.10, unless you were willing to copy a _lot_ of code. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Tue, Sep 7, 2010 at 8:58 PM, Krasimir Angelov
Exactly. But it is probably possible to make version of text which with 6.10 uses some copy of the routines and with 6.12 uses the routines in base.
It might be possible, but I am not going to do it :-) For now at least the API should be made compatible with base. I'm afraid not. The TextEncoding type ties encoding and decoding together, when in pure code you need just one or the other. The TextEncoding design is fine for read/write Handles, where you may need both, but it does not make sense for pure code, where the current API provided by text is more appropriate.

2010/9/8 Bryan O'Sullivan
I'm afraid not. The TextEncoding type ties encoding and decoding together, when in pure code you need just one or the other. The TextEncoding design is fine for read/write Handles, where you may need both, but it does not make sense for pure code, where the current API provided by text is more appropriate.
I don't see how this is related to purity. I can't believe that it is not possible to design single coherent API suitable for both purposes.

On 8 September 2010 05:53, Bryan O'Sullivan
On Tue, Sep 7, 2010 at 8:58 PM, Krasimir Angelov
wrote: Exactly. But it is probably possible to make version of text which with 6.10 uses some copy of the routines and with 6.12 uses the routines in base.
It might be possible, but I am not going to do it :-)
In the longer term I would also like to see these unified, but I don't think it has to be done immediately. It will require more changes in the TextEncoding stuff than in the text package. In particular the TextEncoding will need to be changed to be pure, e.g. using the ST monad rather than the IO monad as it uses currently. I hope that way, the same encoding stuff can be used for IO handles and for pure conversions and that it can perform well in both use cases.
For now at least the API should be made compatible with base.
I'm afraid not. The TextEncoding type ties encoding and decoding together, when in pure code you need just one or the other. The TextEncoding design is fine for read/write Handles, where you may need both, but it does not make sense for pure code, where the current API provided by text is more appropriate.
I have to say I don't understand this. It's easy to use just one direction of encode/decode. Are you saying there are encodings where it only makes sense to implement one direction? Or are you saying that writing decodeUtf8 :: ByteString -> Text is just that much nicer than writing decode utf8 :: ByteString -> Text ? Here is a possible solution: keep the current encodeFoo/decodeFoo in Data.Text.Encoding. Later when we get a sensible reusable TextEncoding abstraction (e.g. by pulling it out of GHC.IO.* and making it use ST so it can be pure) then we add to Data.Text.Encoding: encode :: TextEncoding -> Text -> ByteString decode :: TextEncoding -> ByteString -> Text decodeWith :: TextEncoding -> OnDecodeError -> ByteString -> Text and internally redefine: decodeUtf8 = decode utf8 -- or is it utf8_bom ? Duncan

On Wed, Sep 8, 2010 at 2:56 AM, Duncan Coutts
I'm afraid not. The TextEncoding type ties encoding and decoding together, when in pure code you need just one or the other.
I have to say I don't understand this.
I think I might have been paying insufficiently close attention when I wrote that. I recalled the type specifying both an encoder and a decoder (which it does), and erroneously extended that in my memory to the smart constructor requiring a specification of each (which it doesn't).
Are you saying there are encodings where
it only makes sense to implement one direction? No, the point I thought I was trying to make was that what you need out of an encoding is almost always asymmetric. In network apps, for instance, I need only decoders in one location in my code, and encoders either somewhere else entirely or not at all. But that's moot. But regardless, TextEncoding as it stands isn't quite up to snuff, and I don't want to redo that. and internally redefine: Right.
decodeUtf8 = decode utf8 -- or is it utf8_bom ?
I don't think Data.Text supports that particular encoding.

On Tue, Sep 07, 2010 at 08:26:36AM -0700, Donald Bruce Stewart wrote:
= Proposal: Add Data.Text to the Haskell Platform =
I feel silly saying this, but as this will probably serve as an example of the policy I'll say it anyway: I think this should be: Proposal: Add 'text' to the Haskell Platform
Proposal Author: Don Stewart Maintainer: Bryan O'Sullivan (submitted with his approval)
Credits Proposal author and package maintainer: Bryan O'Sullivan, originally by Tom Harper, based on ByteString? and Vector (fusion) packages.
The following individuals contributed to the review process: Don Stewart, Johan Tibell
These two sections appear to contradict each other.
Also, the hackage page says
Maintainer Bryan O'Sullivan
This is a proposal for the 'text' package
Should mention the version number, and link to the hackage page.
This package provides text processing capabilities that are optimized for performance critical use, both in terms of large data quantities and high speed.
Are there other uses it is less suitable for, or are you just saying that the code has been optimised? If performance is important for the proposal, do you have evidence that it performs well, or a way to check that performance has not regressed in future releases?
using several standard encodings
Just ASCII and UTF*, right? Incidentally, I've just noticed some broken haddock markup for: I/O libraries /do not support locale-sensitive I\O in http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-...
see the 'text-icu' package
Would be nice for this to link to the hackage page.
a much larger variety of encoding functions
Why not bundle these in the text package, or also put this package in the platform? hackage doesn't have the haddocks as I write this, but I assume they are text-specific.
Should link to the version-specific page. This item of "Proposal content" on AddingPackages doesn't seem to be covered: For library packages, an example of how the API is intended to be used should be given. This is really a comment on the process rather than your proposal, but After a proposal is accepted (or conditionally accepted) the proposal must remain on the wiki. and An explicit checklist of the package requirements below is not required. The proposal should state however that all the requirements are met seem incompatible to me, as your All package requirements are met. comment will become out of date as the requirement list evolves. On http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.... a number of haddocks say Subject to fusion. but I can't see an explanation for the new user of what this means or why they should care. Also, what it not be better to say Warning: Not subject to fusion. for the handful that aren't? Currently it's hard to notice. In http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-... I would expect lenientDecode etc to use the On{En,De}codeError type synonyms defined above. In http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-... the choice 'B' seems odd: import qualified Data.Text.Lazy as B I would have expected http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.... to mention the existence of .Lazy in its description, and an explanation of when I should use it. Are there cases when Data.Text is significantly faster than Data.Text.Lazy? Do we need both? (Presumably .Lazy is built on top of Data.Text, but do we need the user to have a complete interface for both?) In http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.... isInfixOf's docs day: O(n+m) The isInfixOf function takes two Texts and returns True iff the first is contained, wholly and intact, anywhere within the second. In (unlikely) bad cases, this function's time complexity degrades towards O(n*m). I think the complexity at the start, in the same place as all the other complexities, ought to be O(n*m), with the common case given afterwards. And replace's docs just say O(m+n) Replace every occurrence of one substring with another. but should presumably be O(n*m). It's also not necessarily clear what m and n refer to.
length :: Text -> Int O(n) Returns the number of characters in a Text. Subject to fusion.
Did you consider keeping the number of characters in the Text directly? Is there a reason it couldn't be done?
prevent is general use
"prevent its general use"
a number of way:
"a number of ways:"
unicode-unaware case conversion (map toUpper is an unsafe case conversion)
Surely this is something that should be added to Data.Char, irrespective of whether text is added to the HP?
the data structure is element-level lazy, whereas a number of applications require either some level of additional strictness
This sentence looks like it has been mis-edited? And by "a number of applications" I think you mean "high performance applications"?
support whole-string case conversion (thus, type correct unicode transformations)
I don't really get what you mean by "type correct" here.
based on unboxed Word16 arrays
Why Word16?
As of Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most popular), in particular, in web programming.
I can't work out what you mean here. Ranked 27 by what metric? Why web programming in particular?
A large testsuite, with coverage data, is provided.
It would be nice if this was on the text package's page, rather than in ~dons.
RecordWildCards
I'm not a fan, but I fear I may be in the minority.
propposal
"proposal"
to expose only 5 modules
9, no?
The public modules expose none of these (?).
None of what? I compared the API of Data.Text and Data.ByteString.Char8 and found a number of differences: BS: break :: (Char -> Bool) -> ByteString -> (ByteString, ByteString) breakEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString) breakSubstring :: ByteString -> ByteString -> (ByteString, ByteString) Text: break :: Text -> Text -> (Text, Text) breakEnd :: Text -> Text -> (Text, Text) breakBy :: (Char -> Bool) -> Text -> (Text, Text) BS: count :: Char -> ByteString -> Int Text: count :: Text -> Text -> Int BS: find :: (Char -> Bool) -> ByteString -> Maybe Char Text: find :: Text -> Text -> [(Text, Text)] findBy :: (Char -> Bool) -> Text -> Maybe Char BS: replicate :: Int -> Char -> ByteString Text: replicate :: Int -> Text -> Text BS: split :: Char -> ByteString -> [ByteString] Text: split :: Text -> Text -> [Text] BS: span :: (Char -> Bool) -> ByteString -> (ByteString, ByteString) spanEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString) Text: spanBy :: (Char -> Bool) -> Text -> (Text, Text) BS: splitBy :: (Char -> Bool) -> Text -> [Text] Text: splitWith :: (Char -> Bool) -> ByteString -> [ByteString] BS: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> (ByteString, Maybe a) Text: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> Text BS: zipWith :: (Char -> Char -> a) -> ByteString -> ByteString -> [a] Text: zipWith :: (Char -> Char -> Char) -> Text -> Text -> Text I think the two APIs ought to be brought into agreement. There are a number of other differences which probably want to be tidied up (mostly functions which are in one package but not the other, and ByteString has IO functions mixed in with the non-IO functions), but those seemed to be the most significant ones. Also, prefixed :: Text -> Text -> Maybe Text is analogous to stripPrefix :: Eq a => [a] -> [a] -> Maybe [a] in Data.List This also made me notice that Text haddocks tend to use 'b' as a type variable rather than 'a', e.g. foldl :: (b -> Char -> b) -> b -> Text -> b Thanks Ian

I'll answer a few of Ian's questions about the design of the text package:
On 7 September 2010 22:50, Ian Lynagh
see the 'text-icu' package
Would be nice for this to link to the hackage page.
a much larger variety of encoding functions
Why not bundle these in the text package, or also put this package in the platform? hackage doesn't have the haddocks as I write this, but I assume they are text-specific.
It would depend on the ICU C library. Similarly if we added a conversion lib based on iconv. The ones in the text package now are pure Haskell.
Are there cases when Data.Text is significantly faster than Data.Text.Lazy? Do we need both? (Presumably .Lazy is built on top of Data.Text, but do we need the user to have a complete interface for both?)
Mm, this is a fair question. In the case of bytestring we need both because sometimes for dealing with foreign code or IO you need the representation to be a contigious block of memory. For text the representation is more abstract so that need does not arrise. One might argue that if it is simply to control strictness then one could use the lazy version and provide a deepseq instance. Here's an alternative argument: suppose we change the representation of strict text to be a tree of chunks (e.g. finger tree). We could achieve effecient concatenation. This representation would be impossible while preserving semantics of a lazy tail. A tree impl that has any kind of balance needs to know the overall length so cannot have a lazy tail.
Did you consider keeping the number of characters in the Text directly? Is there a reason it couldn't be done?
There's little point. Knowing the length does not usually help you save any other O(n) operations. It'd also only help for strict text, not lazy. Just like lists, asking for the length is usually not a good idea.
unicode-unaware case conversion (map toUpper is an unsafe case conversion)
Surely this is something that should be added to Data.Char, irrespective of whether text is added to the HP?
No, not to Data.Char. Case folding is not a per-Char operation, it's only works for [Char] / String / Text. It could be added to Data.String or something.
based on unboxed Word16 arrays
Why Word16?
It doesn't actually matter. It's an implementation detail. It was originally chosen based on benchmarks. It could be changed again based on new benchmarks without affecting the public API.
I compared the API of Data.Text and Data.ByteString.Char8 and found a number of differences:
Many of these are deliberate and sensible. The thing with text as opposed to lists/arrays is that almost all operations you want to do are substring based and not element based. A Unicode code point (a Char) is sadly only roughly related to the human concept of a character. In particular there are combining characters. So even if you want to search or split on a particular "character" that may mean searching for a short sequence of Chars / code points. So where the ByteString API followed the List api by being byte oriented, the Text API is substring oriented.
BS: break :: (Char -> Bool) -> ByteString -> (ByteString, ByteString) breakEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString) breakSubstring :: ByteString -> ByteString -> (ByteString, ByteString) Text: break :: Text -> Text -> (Text, Text) breakEnd :: Text -> Text -> (Text, Text) breakBy :: (Char -> Bool) -> Text -> (Text, Text)
BS: count :: Char -> ByteString -> Int Text: count :: Text -> Text -> Int
BS: find :: (Char -> Bool) -> ByteString -> Maybe Char Text: find :: Text -> Text -> [(Text, Text)] findBy :: (Char -> Bool) -> Text -> Maybe Char
BS: replicate :: Int -> Char -> ByteString Text: replicate :: Int -> Text -> Text
BS: split :: Char -> ByteString -> [ByteString] Text: split :: Text -> Text -> [Text]
BS: span :: (Char -> Bool) -> ByteString -> (ByteString, ByteString) spanEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString) Text: spanBy :: (Char -> Bool) -> Text -> (Text, Text)
BS: splitBy :: (Char -> Bool) -> Text -> [Text] Text: splitWith :: (Char -> Bool) -> ByteString -> [ByteString]
BS: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> (ByteString, Maybe a) Text: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> Text
BS: zipWith :: (Char -> Char -> a) -> ByteString -> ByteString -> [a] Text: zipWith :: (Char -> Char -> Char) -> Text -> Text -> Text
I think the two APIs ought to be brought into agreement.
Perhaps. If so, then it is the ByteString.Char8 that ought to be brought into agreement with Text, not the other way around. I think Text is right in this area. On the other hand, perhaps it makes sense for ByteString.Char8 to remain like the ByteString byte interface which is byte oriented (and probably rightly so). I hope the significance and use of ByteString.Char8 will decrease as Text becomes more popular. ByteString.Char8 is really just for the cases where you're handling ASCII-like protocols.
There are a number of other differences which probably want to be tidied up (mostly functions which are in one package but not the other,
What are you thinking of specifically?
ByteString has IO functions mixed in with the non-IO functions,
Which I don't think was a good idea. I would prefer to split them up.
but those seemed to be the most significant ones. Also,
prefixed :: Text -> Text -> Maybe Text is analogous to stripPrefix :: Eq a => [a] -> [a] -> Maybe [a] in Data.List
Ah, that one probably does make sense to change to match Data.List. Duncan

On Wed, Sep 8, 2010 at 12:21 AM, Duncan Coutts wrote: On 7 September 2010 22:50, Ian Lynagh Data.Text.Lazy? Do we need both? (Presumably .Lazy is built on top of
Data.Text, but do we need the user to have a complete interface for
both?) Mm, this is a fair question. In the case of bytestring we need both
because sometimes for dealing with foreign code or IO you need the
representation to be a contigious block of memory. For text the
representation is more abstract so that need does not arrise. One
might argue that if it is simply to control strictness then one could
use the lazy version and provide a deepseq instance. Here's an alternative argument: suppose we change the representation
of strict text to be a tree of chunks (e.g. finger tree). We could
achieve effecient concatenation. This representation would be
impossible while preserving semantics of a lazy tail. A tree impl that
has any kind of balance needs to know the overall length so cannot
have a lazy tail. The lazy version of Text uses one more word per value than the strict
version. This can be significant for small strings (e.g. ~8 characters)
where the overhead per character already is quite high. If I counted the
size of the BA# constructor correctly, a strict Text has a fixed overhead of
7 words and a lazy Text has an overhead of 8 words. This matters when you
e.g. want to use Texts as keys in a Map.
Btw, I see that the BA# constructor is not manually unpacked into the Array
data type. Is that done automatically since ByteArray# is unlifted or is
there some room for improvement here?
Cheers,
Johan

On 8 September 2010 10:56, Johan Tibell
The lazy version of Text uses one more word per value than the strict version. This can be significant for small strings (e.g. ~8 characters) where the overhead per character already is quite high. If I counted the size of the BA# constructor correctly, a strict Text has a fixed overhead of 7 words and a lazy Text has an overhead of 8 words. This matters when you e.g. want to use Texts as keys in a Map.
Ah, well if we're playing that game then I have a representation where lazy uses the same storage as strict. :-) The trick is to save a word by using smaller length and offset fields (e.g. 16bit). That can be done for lazy but not strict because with lazy you can always break long strings into multiple 2^16 sized chunks whereas for strict it's essential to be able to use 32/64 bit length/offsets.
Btw, I see that the BA# constructor is not manually unpacked into the Array data type. Is that done automatically since ByteArray# is unlifted or is there some room for improvement here?
I'm not sure what you're referring to here, the definition is: data UArray i e = UArray !i !i !Int ByteArray# The ByteArray# is an unlifted type (but its representation is a pointer to a heap object). Duncan

On Wed, Sep 8, 2010 at 12:05 PM, Duncan Coutts wrote: On 8 September 2010 10:56, Johan Tibell Btw, I see that the BA# constructor is not manually unpacked into the
Array
data type. Is that done automatically since ByteArray# is unlifted or is
there some room for improvement here? I'm not sure what you're referring to here, the definition is: data UArray i e = UArray !i !i !Int ByteArray# The ByteArray# is an unlifted type (but its representation is a
pointer to a heap object). The BA# constructor also includes a length field. My question is whether
that gets unpacked into the Array constructor (as in Data.Text.Array, not
UArray).
-- Johan

Hello Johan, Wednesday, September 8, 2010, 3:13:36 PM, you wrote:
data UArray i e = UArray !i !i !Int ByteArray#
The ByteArray# is an unlifted type (but its representation is a pointer to a heap object).
The BA# constructor also includes a length field.
it is the size of memory area allocated, i.e. it's rounded to 4(8) bytes -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Tue, Sep 07, 2010 at 11:21:19PM +0100, Duncan Coutts wrote:
On 7 September 2010 22:50, Ian Lynagh
wrote: I compared the API of Data.Text and Data.ByteString.Char8 and found a number of differences:
Many of these are deliberate and sensible.
Some at least seem just gratuitously different, e.g.: BS: break :: (Char -> Bool) -> ByteString -> (ByteString, ByteString) breakSubstring :: ByteString -> ByteString -> (ByteString, ByteString) Text: break :: Text -> Text -> (Text, Text) breakBy :: (Char -> Bool) -> Text -> (Text, Text)
The thing with text as opposed to lists/arrays is that almost all operations you want to do are substring based and not element based. A Unicode code point (a Char) is sadly only roughly related to the human concept of a character. In particular there are combining characters. So even if you want to search or split on a particular "character" that may mean searching for a short sequence of Chars / code points.
Hmm, wouldn't you want to be able to break on either <a-with-umlaut> or <a> <umlaut combining character> in that case? Also, even if the intention is that you break [<a>, <umlaut combining character>] people will still use it for other things, e.g. break "END FOO" and wonder why they are not able to do likewise with bytestring. Even if there is a case where you would want different behaviour in the two packages, I think it would be bettre if the function names weren't the same.
I think the two APIs ought to be brought into agreement.
Perhaps. If so, then it is the ByteString.Char8 that ought to be brought into agreement with Text, not the other way around.
I don't have an opinion on what the APIs should look like; I'd just like them to be consistent.
There are a number of other differences which probably want to be tidied up (mostly functions which are in one package but not the other,
What are you thinking of specifically?
There are a number of them: In Text only: center, chunksOf, dropAround, dropWhileEnd, justifyLeft, justifyRight, partitionBy, prefixed, replace, strip, stripEnd, stripStart, suffixed, compareLength, toCaseFold, toLower, toUpper In BS only: copy, elem, elemIndex, elemIndexEnd, elemIndices, findIndices, findSubstring, findSubstrings, foldr', foldr1', notElem, readInt, readInteger, sort, unzip
ByteString has IO functions mixed in with the non-IO functions,
Which I don't think was a good idea. I would prefer to split them up.
Agreed, but I would like us to move towards consistency. Thanks Ian

On Wed, Sep 8, 2010 at 7:18 AM, Ian Lynagh
Hmm, wouldn't you want to be able to break on either <a-with-umlaut> or <a> <umlaut combining character> in that case?
No. For cases like that, you'd normalize and perhaps case-fold the text and pattern first, then break on a specific string. (Normalization is handled via text-icu.)

On 9/8/10 10:18 AM, Ian Lynagh wrote:
On Tue, Sep 07, 2010 at 11:21:19PM +0100, Duncan Coutts wrote:
Many of these are deliberate and sensible.
Some at least seem just gratuitously different, e.g.:
BS: break :: (Char -> Bool) -> ByteString -> (ByteString, ByteString) breakSubstring :: ByteString -> ByteString -> (ByteString, ByteString) Text: break :: Text -> Text -> (Text, Text) breakBy :: (Char -> Bool) -> Text -> (Text, Text)
One consistency problem I see with this is that the ByteString versions permit breaking on a disjunctive pattern (e.g., \c -> c=='a' || c=='q') whereas the Text version would require multiple passes to perform these queries, since it takes a Text instead of a (Text->Bool). Since proper usage of Text.break requires being able to do various normalizations on the query, it's unclear whether this inconsistency can be remedied effectively. If it cannot, then it seems that the names of the functions should be adjusted in order to make it clear that there is this difference. Other than that, I do agree with the philosophy of the "deliberate and sensible" differences. Though, given the philosophy that these aren't Char-wise operations, why does Text.breakBy accept a (Char->Bool)? Is this just an optimization for common cases like breaking on Unicode-defined whitespace codepoints? -- Live well, ~wren

On Wed, Sep 8, 2010 at 5:21 PM, wren ng thornton wrote: Text: break :: Text -> Text -> (Text, Text)
breakBy :: (Char -> Bool) -> Text -> (Text, Text) One consistency problem I see with this is that the ByteString versions
permit breaking on a disjunctive pattern (e.g., \c -> c=='a' || c=='q')
whereas the Text version would require multiple passes to perform these
queries, since it takes a Text instead of a (Text->Bool). See breakBy in the email you quoted. Other than that, I do agree with the philosophy of the "deliberate and
sensible" differences. Though, given the philosophy that these aren't
Char-wise operations, why does Text.breakBy accept a (Char->Bool)? Is this
just an optimization for common cases like breaking on Unicode-defined
whitespace codepoints? I kept breakBy in there because it is actually useful. I changed its name
because it's by far less common than "I want to break on a string".

On Thu, Sep 09, 2010 at 11:36:30AM -0700, Bryan O'Sullivan wrote:
On Wed, Sep 8, 2010 at 5:21 PM, wren ng thornton
wrote:
Text: break :: Text -> Text -> (Text, Text) breakBy :: (Char -> Bool) -> Text -> (Text, Text)
Other than that, I do agree with the philosophy of the "deliberate and sensible" differences. Though, given the philosophy that these aren't Char-wise operations, why does Text.breakBy accept a (Char->Bool)? Is this just an optimization for common cases like breaking on Unicode-defined whitespace codepoints?
I kept breakBy in there because it is actually useful. I changed its name because it's by far less common than "I want to break on a string".
That makes sense if considering text in isolation, but as part of a system I would prefer consistent names between []/String, bytestring and text. (It shouldn't necessarily be text that changes, though). I also wonder if there should be a f :: (Text -> Bool) -> Text -> (Text, Text) where e.g. the current break is (\t -> f (t `isPrefixOf`)). Thanks Ian

On 9/10/10 4:10 PM, Ian Lynagh wrote:
On Thu, Sep 09, 2010 at 11:36:30AM -0700, Bryan O'Sullivan wrote:
On Wed, Sep 8, 2010 at 5:21 PM, wren ng thornton
wrote:
Text: break :: Text -> Text -> (Text, Text) breakBy :: (Char -> Bool) -> Text -> (Text, Text)
Other than that, I do agree with the philosophy of the "deliberate and sensible" differences. Though, given the philosophy that these aren't Char-wise operations, why does Text.breakBy accept a (Char->Bool)? Is this just an optimization for common cases like breaking on Unicode-defined whitespace codepoints?
I kept breakBy in there because it is actually useful. I changed its name because it's by far less common than "I want to break on a string".
That makes sense if considering text in isolation, but as part of a system I would prefer consistent names between []/String, bytestring and text. (It shouldn't necessarily be text that changes, though).
I also wonder if there should be a f :: (Text -> Bool) -> Text -> (Text, Text)
Yes, that was my point. I can see uses for (Text->...), ((Text->Bool)->...), and ((Char->Bool)->...) but the middle one ---which seems to be the closest analogue to String and ByteString--- is missing. The first one is posited as a replacement for the middle one, but it is insufficient since it cannot perform disjunctive searches. When I pointed that out, I was directed to the third one, which is insufficient because I may wish to search for "characters" which are not individual codepoints. Why do we not just have the middle ((Text->Bool)->...) option? The other two options are subsumed by it, so the only reason to define them is (a) because they can be implemented more efficiently, or (b) as mere shorthands for doing the necessary coercions to turn the first argument into the right type of predicate. -- Live well, ~wren

On Fri, Sep 10, 2010 at 2:03 PM, wren ng thornton < wren@community.haskell.org> wrote:
Yes, that was my point. I can see uses for (Text->...), ((Text->Bool)->...), and ((Char->Bool)->...) but the middle one ---which seems to be the closest analogue to String and ByteString--- is missing. The first one is posited as a replacement for the middle one, but it is insufficient since it cannot perform disjunctive searches.
I don't think anyone posited it as a replacement for the middle one? We could replace Char->Bool with Text->Bool, but it would be slower (and yes, that matters to me). I don't intend to add it myself, but you're welcome to put together a patch and a set of QuickCheck tests.
Why do we not just have the middle ((Text->Bool)->...) option?
Because you can't do a Boyer-Moore search off it.

On 9/10/10 5:18 PM, Bryan O'Sullivan wrote:
On Fri, Sep 10, 2010 at 2:03 PM, wren ng thornton< wren@community.haskell.org> wrote:
Yes, that was my point. I can see uses for (Text->...), ((Text->Bool)->...), and ((Char->Bool)->...) but the middle one ---which seems to be the closest analogue to String and ByteString--- is missing. The first one is posited as a replacement for the middle one, but it is insufficient since it cannot perform disjunctive searches.
I don't think anyone posited it as a replacement for the middle one?
You did, kinda :) More specifically, you proposed (Text->...) as the analogue for ((Char->Bool)->...) and ((Char8->Bool)->...)
We could replace Char->Bool with Text->Bool, but it would be slower (and yes, that matters to me). I don't intend to add it myself, but you're welcome to put together a patch and a set of QuickCheck tests.
Why do we not just have the middle ((Text->Bool)->...) option?
Because you can't do a Boyer-Moore search off it.
I'm fine with the performance argument, I'm just pointing out why I see the API as inconsistent with the String/ByteString APIs. Since the break function for String/ByteString is rather entrenched as being a method for breaking via a single character, a function that uses Boyer--Moore to break on a string (not just strings required by the mismatch between a "character" and a Char) doesn't seem like the analogous function. I think it's much closer to breakSubstring than it is to break. Whether break/breakSubstring or breakBy/break is the better set of names, that's a different bike shed. As for ((Text->Bool)->...) vs ((Char->Bool)->...), I pointed it out because you've mentioned the discrepancy between Char and "characters". In practice, I'd expect that the majority of characters that people wish to break on are indeed Chars, so performance wins out in API design. However, there's no mention of the discrepancy in the documentation, which I think is an oversight. -- Live well, ~wren

On Tue, Sep 7, 2010 at 6:21 PM, Duncan Coutts
Here's an alternative argument: suppose we change the representation of strict text to be a tree of chunks (e.g. finger tree). We could achieve effecient concatenation. This representation would be impossible while preserving semantics of a lazy tail. A tree impl that has any kind of balance needs to know the overall length so cannot have a lazy tail.
A minor addendum to this point, something based on a bootstrapped skew binomial list of bytestrings can give you a lazy tail and O(log n) indexing and drop, but it loses the cheap append which was the original motivation, so while I find it occasionally useful as a quickly indexable, potentially infinite list, with interesting asymptotics, I'd hesitate to propose it for any sort of standard library.
Did you consider keeping the number of characters in the Text directly? Is there a reason it couldn't be done?
There's little point. Knowing the length does not usually help you
save any other O(n) operations. It'd also only help for strict text, not lazy. Just like lists, asking for the length is usually not a good idea.
Actually it _can_ help quite a bit to know the number of characters (not just words) present in the input: It can tell you of the absence of surrogate pairs if the length (in characters) is the same as the number of words. This then lets you do indexing as an O(1) operation. I used this in a simple fingertree of utf-8 encoded bytestrings to enable faster indexing into leaves that didn't have utf-8 tailbytes present. Since a lot of UTF-8 text has huge swathes of ASCII content, this turned out to be a pretty big win for me. Since the vast majority of UTF-16 text probably contains all Plane 0 (UCS2) content, you can get similar wins. The cost is that you either have to have a sentinel 'don't know' value or pay to scan any arrays that you directly convert to Text. As an aside: I want to like and use Data.Text, but I haven't been able to. Too many of the operations devolve to O(n) that could be O(1) if I had some way to slice based on cursors into the data type, and the current API locks me out of the internals, so I can't write those functions myself. With ByteString at least, if I must, I can go grab the Data.Bytestring.Internal module and have at it. I'm all for a clean API, but a clean API that hurts asymptotics invites a lot of re-invention on the fringes of the community. But really the last nail in the coffin for me, is that often I want to be able to just mmap data in from disk and use it, and rarely is my data on disk stored in UTF-16. Those nits aside, overall, Text provides a clean API for what it does, and I'm completely on board with its inclusion in the platform (but I would really really appreciate having access via a scary deprecated please-don't-use-me Internal module to its guts!) -Edward Kmett

On 8 September 2010 19:43, Edward Kmett
Did you consider keeping the number of characters in the Text directly? Is there a reason it couldn't be done?
There's little point. Knowing the length does not usually help you save any other O(n) operations. It'd also only help for strict text, not lazy. Just like lists, asking for the length is usually not a good idea.
Actually it _can_ help quite a bit to know the number of characters (not just words) present in the input:
It can tell you of the absence of surrogate pairs if the length (in characters) is the same as the number of words. This then lets you do indexing as an O(1) operation. I used this in a simple fingertree of utf-8 encoded bytestrings to enable faster indexing into leaves that didn't have utf-8 tailbytes present. Since a lot of UTF-8 text has huge swathes of ASCII content, this turned out to be a pretty big win for me. Since the vast majority of UTF-16 text probably contains all Plane 0 (UCS2) content, you can get similar wins. The cost is that you either have to have a sentinel 'don't know' value or pay to scan any arrays that you directly convert to Text.
Text does not need to fit every use case. Our impression is that most applications do not need string indexing and for the few that do, a specialised structure is perfectly reasonable (eg using arrays of code points, or trees of arrays of code points). Text should be a replacement for String: a type used for general string manipulation a common type used in interfaces for passing strings between components. In the general rage of use cases it should offer reasonable memory use and performance. It does not need to be the perfect choice for the buffer of a text editor or other special internal use cases. It's an important point in the API design of Text that indexing is discouraged because, given a compact variable-length internal representation, you cannot make any useful promises about the performance of indexing. As you point out you can sometimes do better than O(n) but generally you cannot and it's not good to encourage a style that often performs well but occasionally is terrible.
As an aside:
I want to like and use Data.Text, but I haven't been able to. Too many of the operations devolve to O(n) that could be O(1) if I had some way to slice based on cursors into the data type, and the current API locks me out of the internals, so I can't write those functions myself.
Can you be more specific. The Text API design is that substring is the "cursor" and you don't need any other index.
With ByteString at least, if I must, I can go grab the Data.Bytestring.Internal module and have at it. I'm all for a clean API, but a clean API that hurts asymptotics invites a lot of re-invention on the fringes of the community.
But really the last nail in the coffin for me, is that often I want to be able to just mmap data in from disk and use it, and rarely is my data on disk stored in UTF-16.
It's pretty important that the internal representation not be fixed by being exposed in the API. In future if benchmarks demonstrate that some different encoding offers a better balance of performance and memory use then we need to be able to switch without breaking everyone's programs. Note that even if the internal encoding happened to match your external encoding, the whole lot would still need to be forced into memory to validate the encoding. So you would be able to share memory with the page cache but not avoid any IO. Duncan

On Wed, Sep 8, 2010 at 3:28 PM, Duncan Coutts
Text does not need to fit every use case. Our impression is that most applications do not need string indexing and for the few that do, a specialised structure is perfectly reasonable (eg using arrays of code points, or trees of arrays of code points).
I understand that viewpoint, and it is a perfectly reasonable scope to
choose. I was simply pointing out that it has kept me from being able to use Text for a number of applications. =)
Text should be a replacement for String: a type used for general string manipulation a common type used in interfaces for passing strings between components. In the general rage of use cases it should offer reasonable memory use and performance. It does not need to be the perfect choice for the buffer of a text editor or other special internal use cases.
It's an important point in the API design of Text that indexing is discouraged because, given a compact variable-length internal representation, you cannot make any useful promises about the performance of indexing. As you point out you can sometimes do better than O(n) but generally you cannot and it's not good to encourage a style that often performs well but occasionally is terrible.
As an aside:
I want to like and use Data.Text, but I haven't been able to. Too many of the operations devolve to O(n) that could be O(1) if I had some way to slice based on cursors into the data type, and the current API locks me out of the internals, so I can't write those functions myself.
Can you be more specific. The Text API design is that substring is the "cursor" and you don't need any other index.
I understand that design decision, but it doesn't work for my use case. An example of something I do with fingertrees of bytestrings that I can't do with text can be found in the second set of slides here: http://comonad.com/reader/2009/iteratees-parsec-and-monoid/ When playing with iteratees when you need more input it glues together the buffer onto the end of the existing buffer. If your buffer is a bytestring (or text) that can be pretty painful due to the O(n) append, and potentially large amount of read ahead in certain classes of parsers. Even if you go to lazy bytestrings/text, its kind of a crappy solution, because you are consistently appending the buffers on the wrong end of the list, so the asymptotics suffer. To that end, I would up using a variant on an iteratee which instead of directly concatenating, stored the bytestring fragments in a fingertree (or skew binary random acccess list). This fixed the asymptotics on long look ahead, in exchange for bloating my memory footprint and killing the space usage guarantees provided by Oleg's iteratees. So now I have a fingertree of buffers (be them bytestrings or text) being held on to by my iteratee. I can choose to retain all of the buffers read (rather than just the unparsed 'right hand side') in the fingertree. This lets me backtrack. Now I can implement a parsec parser where the current 'input' is just a cursor location -- an offset into the fingertree of buffers. This means that parsec's built-in try functionality works, and that I can run a parsec parser 'online'. The most common operation is to read the next character from _some_ index/input, but it doesn't always just drop a character as it goes. Futhermore, I have these cursors as meaningful references into the input buffer. I can say a lot about them. I can slice between two cursors from the same set of buffers and I can even know if they came from the same bytestring/text fragment based on the index and where each fragment starts and ends, and if so, no copying is involved at all. Since this is happening inside of a monad (Parsec), there could be a bunch of these cursors alive at any point. My memory usage dropped off _dramatically_ when I made this improvement. There doesn't exist a good equivalent under the API that I have available to me in Text, and I don't have the ability to build one over the exposed API. I can implement the common 'parsec' input pattern, but not invert control of it using the iteratee machinery to let me run it on-line, not without walking backwards and forwards over the same bits of Text repeatedly and doing all sorts of manual tracking that blows up my asymptotics. Do I believe that the Text API should expose this kind of functionality? I'd hazard not. In the above, each of my cursor positions is well formed, it always comes between two characters. Any slice made between any two valid cursors for the buffer set succeeds. To provide those guarantees you need rank 2 types and general scariness that does not belong in a Platform library. However, my first attempt at implementing this WAS on top of the Text API, until I ran into a situation where the design goal of encapsulation forced me to a full stop, and required me to go back and re-engineer on top of ByteString. ByteString has a very similar API, but is at least currently willing to expose an Internal module as a dodge. I realize there is a desire from a library encapsulation perspective to hide the internals, to make them easier to change, and to make it so the public facade isn't so brittle. I simply wish to point out that there is a cost for such abstraction. Namely that people who would otherwise happily be using a library and singing its praises can be completely locked out of it.
With ByteString at
least, if I must, I can go grab the Data.Bytestring.Internal module and have at it. I'm all for a clean API, but a clean API that hurts asymptotics invites a lot of re-invention on the fringes of the community.
But really the last nail in the coffin for me, is that often I want to be able to just mmap data in from disk and use it, and rarely is my data on disk stored in UTF-16.
It's pretty important that the internal representation not be fixed by being exposed in the API. In future if benchmarks demonstrate that some different encoding offers a better balance of performance and memory use then we need to be able to switch without breaking everyone's programs.
I understand that. However, for me as an end user, the immediate consequence of your long term intentions are that I am stuck re-inventing the wheel and I get to derive no benefit from an otherwise wonderful library. This consideration does not come without a cost. Note that even if the internal encoding happened to match your
external encoding, the whole lot would still need to be forced into memory to validate the encoding. So you would be able to share memory with the page cache but not avoid any IO.
No arguments there. Again, there is a wholehearted +1 from me for the adoption of 'text' as part of the platform. I simply felt the need to qualify that +1 with why I find myself, sadly, unable to derive any benefit from its existence. -Edward Kmett

On 9/8/10 4:34 PM, Edward Kmett wrote:
Do I believe that the Text API should expose this kind of functionality? I'd hazard not. In the above, each of my cursor positions is well formed, it always comes between two characters. Any slice made between any two valid cursors for the buffer set succeeds. To provide those guarantees you need rank 2 types and general scariness that does not belong in a Platform library.
Well, as per the proposal, text already requires Rank2Types, ScopedTypeVariables, and ExistentialQuantification. That should be enough to get started >;) -- Live well, ~wren

On Wed, Sep 8, 2010 at 11:43 AM, Edward Kmett
I want to like and use Data.Text, but I haven't been able to. Too many of the operations devolve to O(n) that could be O(1) if I had some way to slice based on cursors into the data type, and the current API locks me out of the internals, so I can't write those functions myself.
What kinds of functions are you thinking about? The Text type *is* the cursor into the underlying array. What specific operations do you want to perform that you'd prefer to be O(1)? But really the last nail in the coffin for me, is that often I want to be
able to just mmap data in from disk and use it, and rarely is my data on disk stored in UTF-16.
Yep, the text library just isn't intended for that right now.

Thanks for your comments, Ian. I appreciate your time and care in looking this over!
Incidentally, I've just noticed some broken haddock markup for: I/O libraries /do not support locale-sensitive I\O in
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-...
Thanks for spotting that. It appears to be due to a Haddock bug, unfortunately.
a much larger variety of encoding functions
Why not bundle these in the text package, or also put this package in the platform?
Either one would induce a dependency on text-icu, which is not as mature as text, and which would imply a dependency on the rather large ICU library. I do believe that text-icu should be submitted, but not until it's ready. On
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.... a number of haddocks say Subject to fusion. but I can't see an explanation for the new user of what this means or why they should care.
That's not quite true: it's actually the very first thing documented: http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.... However, that description is skimpy, and I've replaced it: -- Most of the functions in this module are subject to /fusion/, -- meaning that a pipeline of such functions will usually allocate at -- most one 'Text' value. -- -- As an example, consider the following pipeline: -- -- > import Data.Text as T -- > import Data.Text.Encoding as E -- > -- > countChars :: ByteString -> Int -- > countChars = T.length . T.toUpper . E.decodeUtf8 -- -- From the type signatures involved, this looks like it should -- allocate one 'ByteString' value, and two 'Text' values. However, -- when a module is compiled with optimisation enabled under GHC, the -- two intermediate 'Text' values will be optimised away, and the -- function will be compiled down to a single loop over the source -- 'ByteString'. -- -- Functions that can be fused by the compiler are marked with the -- phrase \"Subject to fusion\".
In
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-... I would expect lenientDecode etc to use the On{En,De}codeError type synonyms defined above.
Good point. I've fixed that up.
In
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-... the choice 'B' seems odd: import qualified Data.Text.Lazy as B
Yep. Fixed :-)
I would have expected
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.... to mention the existence of .Lazy in its description, and an explanation of when I should use it.
I've expanded that discussion. Are there cases when Data.Text is significantly faster than
Data.Text.Lazy?
It's often about twice as fast, but that depends on the nature of the code and data involved.
In
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.... isInfixOf's docs day: O(n+m) The isInfixOf function takes two Texts and returns True iff the first is contained, wholly and intact, anywhere within the second. In (unlikely) bad cases, this function's time complexity degrades towards O(n*m). I think the complexity at the start, in the same place as all the other complexities, ought to be O(n*m), with the common case given afterwards.
I'd prefer to keep this as is.
And replace's docs just say O(m+n) Replace every occurrence of one substring with another. but should presumably be O(n*m). It's also not necessarily clear what m and n refer to.
The two parameters to the function?
unicode-unaware case conversion (map toUpper is an unsafe case
conversion)
Surely this is something that should be added to Data.Char, irrespective of whether text is added to the HP?
Yes, but that's a not-this-problem problem.
A large testsuite, with coverage data, is provided.
It would be nice if this was on the text package's page, rather than in ~dons.
I don't know how to do that.
RecordWildCards
I'm not a fan, but I fear I may be in the minority.
It's just used internally, so why do you mind? There are a number of other differences which probably want to be tidied
up (mostly functions which are in one package but not the other, and ByteString has IO functions mixed in with the non-IO functions), but those seemed to be the most significant ones. Also, prefixed :: Text -> Text -> Maybe Text is analogous to stripPrefix :: Eq a => [a] -> [a] -> Maybe [a] in Data.List
I hadn't seen that. Hmm. For use with view patterns, I prefer the name I'm using right now. This also made me notice that Text haddocks tend to use 'b' as a type
variable rather than 'a', e.g. foldl :: (b -> Char -> b) -> b -> Text -> b
Historical artifact :-)

On Tue, Sep 07, 2010 at 07:10:27PM -0700, Bryan O'Sullivan wrote:
Thanks for your comments, Ian. I appreciate your time and care in looking this over!
Actually, it's interesting you say that, because I don't think I looked at the package carefully. I didn't look at the source at all, I briefly skimmed the haddocks (mostly just to check that it looks like they all existed, as that's one of the criteria), and I didn't check that the package API looks sensible and consistent. In fact, the only reason I looked at the API at all was that I had something to diff it against. As a comment on the process, perhaps we should require that there are 2 or 3 people who can say that they have used the API (perhaps with hpc results to see /how much/ they use it), and that it seems sensible (i.e. they weren't having to work around missing or broken functionality). Actually, I've just taken a quick look at a random bit of code, and with Data.Text.Foreign.fromPtr and init :: Text -> Text init (Text arr off len) | len <= 0 = emptyError "init" | n >= 0xDC00 && n <= 0xDFFF = textP arr off (len-2) | otherwise = textP arr off (len-1) where n = A.unsafeIndex arr (off+len-1) it looks like I can create a Text with length -1 by doing (init (fromPtr [0xDC00] 1)), which makes me nervous. I wonder if fromPtr should be renamed unsafeFromPtr. init would still make me nervous, though. By the way, fromPtr asserts (len > 0), but from the haddock docs I'd assume that (fromPtr p 0) is fine.
Incidentally, I've just noticed some broken haddock markup for: I/O libraries /do not support locale-sensitive I\O in
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-...
Thanks for spotting that. It appears to be due to a Haddock bug, unfortunately.
Looking at the source, I'd guess you can work around it by moving the linebreaks. And it actually looks like 2 haddock bugs: /.../ can't span lines, and \/ isn't recognised inside /.../. Would be good to get haddock tickets filed.
Subject to fusion. but I can't see an explanation for the new user of what this means or why they should care.
That's not quite true: it's actually the very first thing documented:
Sorry, my fault! I read the "Description" at the top, and erroneously assumed that only function-specific docs would follow the "Synopsis".
And replace's docs just say O(m+n) Replace every occurrence of one substring with another. but should presumably be O(n*m). It's also not necessarily clear what m and n refer to.
The two parameters to the function?
But replace takes 3 arguments! The complexity must be at least the second and third multiplied together, as replace "x" (replicate y 'y') (replicate z 'x') makes y*z words in the heap.
unicode-unaware case conversion (map toUpper is an unsafe case
conversion)
Surely this is something that should be added to Data.Char, irrespective of whether text is added to the HP?
Yes, but that's a not-this-problem problem.
Oh, I didn't mean to suggest that you should fix it. I just don't think it motivates adding the text package to the HP, and thus doesn't belong in the proposal.
RecordWildCards
I'm not a fan, but I fear I may be in the minority.
It's just used internally, so why do you mind?
I'm sure I'll need to look at the code at some point.
There are a number of other differences which probably want to be tidied
up (mostly functions which are in one package but not the other, and ByteString has IO functions mixed in with the non-IO functions), but those seemed to be the most significant ones. Also, prefixed :: Text -> Text -> Maybe Text is analogous to stripPrefix :: Eq a => [a] -> [a] -> Maybe [a] in Data.List
I hadn't seen that. Hmm. For use with view patterns, I prefer the name I'm using right now.
I'd like us to proceed in a way that means we haven't still got Data.List.stripPrefix and Data.Text.prefixed in the HP in 3 years time. Thanks Ian

On 8 September 2010 14:43, Ian Lynagh
There are a number of other differences which probably want to be tidied
up (mostly functions which are in one package but not the other, and ByteString has IO functions mixed in with the non-IO functions), but those seemed to be the most significant ones. Also, prefixed :: Text -> Text -> Maybe Text is analogous to stripPrefix :: Eq a => [a] -> [a] -> Maybe [a] in Data.List
I hadn't seen that. Hmm. For use with view patterns, I prefer the name I'm using right now.
I'd like us to proceed in a way that means we haven't still got Data.List.stripPrefix and Data.Text.prefixed in the HP in 3 years time.
Yes, they should be the same, one way or the other. Duncan

On Wed, Sep 8, 2010 at 6:55 AM, Duncan Coutts
I'd like us to proceed in a way that means we haven't still got Data.List.stripPrefix and Data.Text.prefixed in the HP in 3 years time.
Yes, they should be the same, one way or the other.
Well since there's an existing precedent, I changed the names in Data.Text to strip{Pre,Suf}fix.

A few more bits: copyI describes its last parameter as First offset in source /not/ to copy (i.e. /not/ length) but I think it's actually the first offset in /dest/ not to /write to/. In Data.Text.append there's a (len1+len2) that could be just len. This program segfaults (on 64bit; try 2^28 if on 32bit): import Data.Text as T main :: IO () main = do let s = T.replicate (2^60) (T.pack "0123456789abcdef") print $ T.last s print $ T.head s I'm not sure how hard we're expecting ourselves to be looking for bugs in proposed packages. More code review would be a great thing, but we'd have to find the time to do it. Thanks Ian

On Wed, Sep 8, 2010 at 6:12 PM, Ian Lynagh
copyI describes its last parameter as First offset in source /not/ to copy (i.e. /not/ length) but I think it's actually the first offset in /dest/ not to /write to/.
That's correct.
In Data.Text.append there's a (len1+len2) that could be just len.
Fixed :)
This program segfaults (on 64bit; try 2^28 if on 32bit):
Also fixed, and a regression test added. Thanks!

On Thu, Sep 09, 2010 at 08:56:29PM -0700, Bryan O'Sullivan wrote:
On Wed, Sep 8, 2010 at 6:12 PM, Ian Lynagh
wrote: This program segfaults (on 64bit; try 2^28 if on 32bit):
Also fixed, and a regression test added.
Still segfaults for me in 0.9, and I can't see the test for it. Thanks Ian

On Sun, Oct 03, 2010 at 11:05:49PM -0400, Bryan O'Sullivan wrote:
On Tue, Sep 21, 2010 at 10:53 AM, Ian Lynagh
wrote: Still segfaults for me in 0.9, and I can't see the test for it.
Turns out there were two bugs, not just one. It's definitely fixed in 0.9.0.1 :-)
Yup, that fixes that case, but this one still segfaults (on a 32bit platform): import qualified Data.Text as T main :: IO () main = do let s = T.replicate (5^11) (T.pack (replicate 89 'a')) print $ T.last s print $ T.head s Thanks Ian

Hi all, I've updated the list of open issues. I should have done so while each issue was being discussed, but I didn't do so. My apologies. If you think there is a significant open issue that I haven't listed, please drop me an email and I'll add it. Note that the list of open issues is not necessarily the same as as a list of changes to make to the package. They are there so we can track our decision on each particular issue (which might be: do nothing). Here's the current list: 1. The text-icu package is not part of this proposal, as adding it would make the platform depend on the ICU C library. Both the text package and the base package provide Unicode encoding/decoding functionality. Perhaps some of this functionality could be merged. 2. Some functions have similar names to functions in the bytestring package but have different types (other than ByteString vs Text.) 3. Some functions have the same type but different names. 4. Do we need both a strict and lazy version of Text? The strict version needs one less indirection, can be unpacked in function arguments and takes less space when stored in data types. Please be mindful of Bryan's time and try to balance the amount of work needed to the value of a particular API change. Cheers, Johan

On Mon, Oct 04, 2010 at 01:59:55PM -0400, Johan Tibell wrote:
I've updated the list of open issues.
Thanks!
1. The text-icu package is not part of this proposal, as adding it would make the platform depend on the ICU C library.
FWIW, I think it was me who raised that, and the answer satisfied me.
Please be mindful of Bryan's time and try to balance the amount of work needed to the value of a particular API change.
As a general process point, surely we should just work out what API changes we think should be made, and if the maintainer/developers can't make them in time for the next HP release then the package can go into the following one? Thanks Ian

igloo:
On Mon, Oct 04, 2010 at 01:59:55PM -0400, Johan Tibell wrote:
I've updated the list of open issues.
Thanks!
1. The text-icu package is not part of this proposal, as adding it would make the platform depend on the ICU C library.
FWIW, I think it was me who raised that, and the answer satisfied me.
Please be mindful of Bryan's time and try to balance the amount of work needed to the value of a particular API change.
As a general process point, surely we should just work out what API changes we think should be made, and if the maintainer/developers can't make them in time for the next HP release then the package can go into the following one?
Or: we put it in, and the next major rev includes those changes. The key thing is the changes are documented. Until all of the HP is at the same high level, we may need to make some tradefoffs to move faster.

On Mon, Oct 04, 2010 at 02:35:43PM -0700, Donald Bruce Stewart wrote:
igloo:
As a general process point, surely we should just work out what API changes we think should be made, and if the maintainer/developers can't make them in time for the next HP release then the package can go into the following one?
Or: we put it in, and the next major rev includes those changes.
If there are many changes, then I don't see the advantage of putting it in before they are made. We would be telling people to write code against the old API, and then breaking their code 6 months later. If there are few changes, and the developers still don't have time to fix them, that would suggest the package doesn't meet the spirit of the "All packages in the platform must have a maintainer" requirement. Note also that accepting the package as-is removes some of the motivation for the developers to make the changes.
Until all of the HP is at the same high level, we may need to make some tradefoffs to move faster.
I think the bar should be higher for new packages, or it will get harder and harder to raise the standard as the platform grows. Thanks Ian

On Mon, 2010-10-04 at 22:34 +0100, Ian Lynagh wrote:
As a general process point, surely we should just work out what API changes we think should be made, and if the maintainer/developers can't make them in time for the next HP release then the package can go into the following one?
That is correct. The package proposal process allows for exactly that. A package is considered as conditionally accepted if, by the discussion deadline, the libraries mailing list reaches consensus to accept it on condition that further minor changes are made. The agreed changes must be made before the package is included in any release. If these changes are made in time for the normal package freeze dates (as set by the release team) then the package is considered as accepted. If the changes cannot be made in time for the immediate major release but are made in time for the subsequent major release then the package is considered as accepted for that subsequent major release and does not need to be re-reviewed. http://trac.haskell.org/haskell-platform/wiki/AddingPackages#Acceptance Duncan

On Mon, Oct 4, 2010 at 10:59 AM, Johan Tibell
I've updated the list of open issues.
Thanks, Johan. I've updated those issues. With the exception of #3 ("Some functions have the same type but different names"), I don't think any of them is particularly significant, or worth addressing in the short term. I believe that #3 is actually resolved, but I haven't deleted it pending confirmation from Ian or others.

On Mon, Oct 04, 2010 at 04:21:58PM -0700, Bryan O'Sullivan wrote:
With the exception of #3 ("Some functions have the same type but different names"), I don't think any of them is particularly significant, or worth addressing in the short term. I believe that #3 is actually resolved, but I haven't deleted it pending confirmation from Ian or others.
My opinion is still that the List, bytestring and text APIs should match where they overlap (overlap either in name or in type). This might mean changing List/bytestring, rather than text, in some cases. But if the consensus is otherwise, then no problem. I don't remember how many opinions were expressed, or what they were. Thanks Ian

Ian Lynagh wrote:
On Mon, Oct 04, 2010 at 04:21:58PM -0700, Bryan O'Sullivan wrote:
With the exception of #3 ("Some functions have the same type but different names"), I don't think any of them is particularly significant, or worth addressing in the short term. I believe that #3 is actually resolved, but I haven't deleted it pending confirmation from Ian or others.
My opinion is still that the List, bytestring and text APIs should match where they overlap (overlap either in name or in type). This might mean changing List/bytestring, rather than text, in some cases.
But if the consensus is otherwise, then no problem. I don't remember how many opinions were expressed, or what they were.
I agree with Ian's view. Inconsistencies in core libraries will cause people a lot of confusion. On the other hand, if noone has the time to actually go through the exercise of figuring out what needs to change, then having text in the platform with inconsistencies would be better than not having it. Ganesh =============================================================================== Please access the attached hyperlink for an important electronic communications disclaimer: http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html ===============================================================================

On 5 October 2010 11:56, Ian Lynagh
On Mon, Oct 04, 2010 at 04:21:58PM -0700, Bryan O'Sullivan wrote:
With the exception of #3 ("Some functions have the same type but different names"), I don't think any of them is particularly significant, or worth addressing in the short term. I believe that #3 is actually resolved, but I haven't deleted it pending confirmation from Ian or others.
My opinion is still that the List, bytestring and text APIs should match where they overlap (overlap either in name or in type). This might mean changing List/bytestring, rather than text, in some cases.
But if the consensus is otherwise, then no problem. I don't remember how many opinions were expressed, or what they were.
FWIW I'm also strongly in favour of having name consistency before including text in the Platform. Ideally the HP should feel like it was written by one person. Introducing several names for the same concept just increases accidental complexity in the system. Cheers, Max

On Tue, Oct 5, 2010 at 11:07 AM, Max Bolingbroke wrote: FWIW I'm also strongly in favour of having name consistency before
including text in the Platform. Well, if that's the consensus, then we're at an impasse, because the
libraries process is a trackless mire as has been fairly clearly
demonstrated over the past few days, and I'm not going to change the names
or types of the text functions. I've got better things to do than slog
through such an ungratifying and demoralising morass.

On Oct 5, 2010, at 2:18 PM, Bryan O'Sullivan wrote:
On Tue, Oct 5, 2010 at 11:07 AM, Max Bolingbroke
wrote: FWIW I'm also strongly in favour of having name consistency before including text in the Platform.
Well, if that's the consensus, then we're at an impasse, because the libraries process is a trackless mire as has been fairly clearly demonstrated over the past few days, and I'm not going to change the names or types of the text functions. I've got better things to do than slog through such an ungratifying and demoralising morass.
Luckily there's an alternative that involves neither the libraries process or changing the Text API. Instead, we could, as Bryan had suggested prior, simply change the Bytestring API to match that of Text. As has been pointed out, bytestring is not under the direction of the libraries list, but is actively maintained by dons and dcoutts. So if the bytestring maintainers look favorably on matching the Text API, things can be moved along very quickly. That said, while most of the changes in Text make perfect sense for a character based rather than text based API, and most make sense to move to bytestring, there are the few ones which are pure taste. Those are, particularly, break vs. breakBy vs. breakSubstring, and find vs. findBy, and span vs. spanBy. In all those cases, there's an argument that Text makes the "common case" the short one, and this is good. However, break, span, and find are all part of my standard haskell vocabulary. I know their types from Data.List, use them frequently, and think with them. At this point, to change what they mean would lead to a fair amount of frustration and irritation. Ideally, text could be changed to match bytestring for those functions, bytestring to match text otherwise, and then we could move along to getting text into the platform, which is what we all ultimately want :-) Cheers, Sterl.

On Tue, Oct 05, 2010 at 02:50:14PM -0400, Sterling Clover wrote:
That said, while most of the changes in Text make perfect sense for a character based rather than text based API, and most make sense to move to bytestring, there are the few ones which are pure taste.
Those are, particularly, break vs. breakBy vs. breakSubstring, and find vs. findBy, and span vs. spanBy.
In all those cases, there's an argument that Text makes the "common case" the short one, and this is good. However, break, span, and find are all part of my standard haskell vocabulary. I know their types from Data.List, use them frequently, and think with them. At this point, to change what they mean would lead to a fair amount of frustration and irritation.
On a scan through Data.Text, it seems the only incompatibilities with Data.List are breakBy vs break, findBy vs find, partitionBy vs partition and spanBy vs span. In each of these cases bytestring agrees with Data.List, and I'd agree that text should too.

On 5 Oct 2010, at 19:18, Bryan O'Sullivan wrote:
Well, if that's the consensus, then we're at an impasse, because the libraries process is a trackless mire as has been fairly clearly demonstrated over the past few days, and I'm not going to change the names or types of the text functions. I've got better things to do than slog through such an ungratifying and demoralising morass.
But if someone else were to track down all the name inconsistencies, fix them, and submit a patch to you, would you accept it? Regards, Malcolm P.S. I agree with you somewhat about the demoralising nature of the Haskell Platform addition process. I have been sitting on the sidelines thinking "thank goodness it wasn't one of my packages that was proposed". I've also been thinking, how come new packages in the HP are held to higher standards than the existing ones? AFAICT, many of the current packages are in the Platform simply because a ghc hacker once decided to use them (and hence they became widely distributed, regardless of quality). (Yes, I'm looking at you, parsec and containers.) Sadly, I don't have any better suggestions for a submissions process.

On Tue, 2010-10-05 at 20:29 +0100, Malcolm Wallace wrote:
On 5 Oct 2010, at 19:18, Bryan O'Sullivan wrote:
Well, if that's the consensus, then we're at an impasse, because the libraries process is a trackless mire as has been fairly clearly demonstrated over the past few days, and I'm not going to change the names or types of the text functions. I've got better things to do than slog through such an ungratifying and demoralising morass.
But if someone else were to track down all the name inconsistencies, fix them, and submit a patch to you, would you accept it?
I've not followed the entire thread but perhaps it has not been discussed sufficiently clearly that it is not a question of accidental inconsistencies. There was (and imho correct) choice to make the primary mode of operation on Text be substring rather than element based. Thus the substring options get the primary names and predicate versions that operate on elements get secondary names. It's also not at all clear to me (as someone with both an interest in bytestring and text) that it makes sense for bytestring to change to be substring based. It's plausible but it seems to me to make more sense to keep bytestring's operations element (ie byte) based rather than substring based. If everyone else thinks that it's vital that the same name in text be element based then we do have a tricky question with naming. We do want the primary operations to be substring, not element, so we would have to come up with some naming convention that lets us have sensible names for the primary operations while still having (much less useful) element based ones.
I've also been thinking, how come new packages in the HP are held to higher standards than the existing ones? AFAICT, many of the current packages are in the Platform simply because a ghc hacker once decided to use them (and hence they became widely distributed, regardless of quality). (Yes, I'm looking at you, parsec and containers.) Sadly, I don't have any better suggestions for a submissions process.
We were aware of the problem of the grandfathered packages when we started. As you say there's not an obvious solution. We don't want to just say that things keep getting in without enough review. There is also the opportunity to improve things that are there already, e.g. see the current proposal to improve mtl. Also, hopefully the higher quality of the newer packages will embarrass us enough into improving the existing ones. As for a demoralising process, perhaps we can make better use of proposers (as distinct from maintainers). There is also the consensus protocol [1] described in the procedure. We should also ask the steering committee [2] to help keep things moving along. Duncan [1]: http://trac.haskell.org/haskell-platform/wiki/AddingPackages#Consensus [2]: http://trac.haskell.org/haskell-platform/wiki/Members#SteeringCommittee

On Fri, Oct 08, 2010 at 01:23:28AM +0100, Duncan Coutts wrote:
If everyone else thinks that it's vital that the same name in text be element based then we do have a tricky question with naming. We do want the primary operations to be substring, not element, so we would have to come up with some naming convention that lets us have sensible names for the primary operations while still having (much less useful) element based ones.
More seriously, 5 of the names conflict with corresponding standard list functions. As far as I can see, the naming incompatibilities between the 3 packages are the following: text base bytestring type in text (or equivalent if absent) --------------------------------------------------------------------------- break - breakSubstring Text -> Text -> (Text, Text) breakBy break break (Char -> Bool) -> Text -> (Text, Text) breakEnd - - Text -> Text -> (Text, Text) - - breakEnd (Char -> Bool) -> Text -> (Text, Text) count - - Text -> Text -> Int - - count Char -> Text -> Int find - - Text -> Text -> [(Text, Text)] findBy find find (Char -> Bool) -> Text -> Maybe Char partitionBy partition - (Char -> Bool) -> Text -> (Text, Text) replicate - - Int -> Text -> Text - replicate replicate Int -> Char -> Text spanBy span span (Char -> Bool) -> Text -> (Text, Text) split - - Text -> Text -> [Text] - - split Char -> Text -> [Text] splitBy - splitWith (Char -> Bool) -> Text -> [Text] unfoldrN - - Int -> (a -> Maybe (Char, a)) -> a -> Text - - unfoldrN Int -> (a -> Maybe (Char, a)) -> a -> (Text, Maybe a) zipWith zipWith - (Char -> Char -> Char) -> Text -> Text -> Text - zipWith zipWith (Char -> Char -> a) -> Text -> Text -> [a] * The -By suffix has been used for predicate versions in 5 cases here, but not for filter and findIndex. * The find function has no connection with findBy. It ought to have a name that is the plural of the name of the break function.

On Fri, Oct 8, 2010 at 11:00 AM, Ross Paterson
As far as I can see, the naming incompatibilities between the 3 packages are the following:
text base bytestring type in text (or equivalent if absent) --------------------------------------------------------------------------- break - breakSubstring Text -> Text -> (Text, Text) breakBy break break (Char -> Bool) -> Text -> (Text, Text) breakEnd - - Text -> Text -> (Text, Text) - - breakEnd (Char -> Bool) -> Text -> (Text, Text) count - - Text -> Text -> Int - - count Char -> Text -> Int find - - Text -> Text -> [(Text, Text)] findBy find find (Char -> Bool) -> Text -> Maybe Char partitionBy partition - (Char -> Bool) -> Text -> (Text, Text) replicate - - Int -> Text -> Text - replicate replicate Int -> Char -> Text spanBy span span (Char -> Bool) -> Text -> (Text, Text) split - - Text -> Text -> [Text] - - split Char -> Text -> [Text] splitBy - splitWith (Char -> Bool) -> Text -> [Text] unfoldrN - - Int -> (a -> Maybe (Char, a)) -> a -> Text - - unfoldrN Int -> (a -> Maybe (Char, a)) -> a -> (Text, Maybe a) zipWith zipWith - (Char -> Char -> Char) -> Text -> Text -> Text - zipWith zipWith (Char -> Char -> a) -> Text -> Text -> [a]
* The -By suffix has been used for predicate versions in 5 cases here, but not for filter and findIndex. * The find function has no connection with findBy. It ought to have a name that is the plural of the name of the break function.
Given text's focus on subsequences rather than single elements these differences make sense to me after a quick first scan.

On Friday 08 October 2010 17:16:20, Johan Tibell wrote:
On Fri, Oct 8, 2010 at 11:00 AM, Ross Paterson
wrote: As far as I can see, the naming incompatibilities between the 3 packages are the following:
text base bytestring type in text (or equivalent if absent) ---------------------------------------------------------------------- ----- break - breakSubstring Text -> Text -> (Text, Text) breakBy break break (Char -> Bool) -> Text -> (Text, Text) breakEnd - - Text -> Text -> (Text, Text) - - breakEnd (Char -> Bool) -> Text -> (Text, Text) count - - Text -> Text -> Int - - count Char -> Text -> Int find - - Text -> Text -> [(Text, Text)] findBy find find (Char -> Bool) -> Text -> Maybe Char partitionBy partition - (Char -> Bool) -> Text -> (Text, Text) replicate - - Int -> Text -> Text - replicate replicate Int -> Char -> Text spanBy span span (Char -> Bool) -> Text -> (Text, Text) split - - Text -> Text -> [Text] - - split Char -> Text -> [Text] splitBy - splitWith (Char -> Bool) -> Text -> [Text] unfoldrN - - Int -> (a -> Maybe (Char, a)) -> a -> Text - - unfoldrN Int -> (a -> Maybe (Char, a)) -> a -> (Text, Maybe a) zipWith zipWith - (Char -> Char -> Char) -> Text -> Text -> Text - zipWith zipWith (Char -> Char -> a) -> Text -> Text -> [a]
* The -By suffix has been used for predicate versions in 5 cases here, but not for filter and findIndex. * The find function has no connection with findBy. It ought to have a name that is the plural of the name of the break function.
Given text's focus on subsequences rather than single elements these differences make sense to me after a quick first scan.
Sure. But at least with break and replicate there's the problem that the functions on lists are used a lot, so text's naming violates the principle of least surprise. (Not sure whether find is used a lot.) So what's worse, having the good names for substring oriented things in text and surprise unsuspecting users when "break (== '\n') text" doesn't compile or having more cumbersome names for the substring oriented functions? I think having break and replicate corresponding to the Prelude types is preferable, although within the text package the current names are darned nice.

I'm not sure I agree with the principle of least surprise here. Dealing
correctly with text is hard and the text package presents a nice API for it.
Surprise is good, here, I think: many developers out there don't even
realize ASCII isn't good enough for text, and even the unicode-aware might
not realize that indexing into text or case-insensitive matching are
non-trivial operations. Forcing them to fix their code (through compilation
errors) seems like a good idea, to me.
A new module, Data.Text.ListCrutch, could contain an exact duplicate of the
Data.List functions with warnings attached to them warning (through
deprecated pragmas?) users of the library that text is not simply a list of
characters, and they should stop treating it as such.
On Fri, Oct 8, 2010 at 5:42 PM, Daniel Fischer
On Friday 08 October 2010 17:16:20, Johan Tibell wrote:
On Fri, Oct 8, 2010 at 11:00 AM, Ross Paterson
wrote: As far as I can see, the naming incompatibilities between the 3 packages are the following:
text base bytestring type in text (or equivalent if absent) ---------------------------------------------------------------------- ----- break - breakSubstring Text -> Text -> (Text, Text) breakBy break break (Char -> Bool) -> Text -> (Text, Text) breakEnd - - Text -> Text -> (Text, Text) - - breakEnd (Char -> Bool) -> Text -> (Text, Text) count - - Text -> Text -> Int - - count Char -> Text -> Int find - - Text -> Text -> [(Text, Text)] findBy find find (Char -> Bool) -> Text -> Maybe Char partitionBy partition - (Char -> Bool) -> Text -> (Text, Text) replicate - - Int -> Text -> Text - replicate replicate Int -> Char -> Text spanBy span span (Char -> Bool) -> Text -> (Text, Text) split - - Text -> Text -> [Text] - - split Char -> Text -> [Text] splitBy - splitWith (Char -> Bool) -> Text -> [Text] unfoldrN - - Int -> (a -> Maybe (Char, a)) -> a -> Text - - unfoldrN Int -> (a -> Maybe (Char, a)) -> a -> (Text, Maybe a) zipWith zipWith - (Char -> Char -> Char) -> Text -> Text -> Text - zipWith zipWith (Char -> Char -> a) -> Text -> Text -> [a]
* The -By suffix has been used for predicate versions in 5 cases here, but not for filter and findIndex. * The find function has no connection with findBy. It ought to have a name that is the plural of the name of the break function.
Given text's focus on subsequences rather than single elements these differences make sense to me after a quick first scan.
Sure. But at least with break and replicate there's the problem that the functions on lists are used a lot, so text's naming violates the principle of least surprise. (Not sure whether find is used a lot.)
So what's worse, having the good names for substring oriented things in text and surprise unsuspecting users when "break (== '\n') text" doesn't compile or having more cumbersome names for the substring oriented functions?
I think having break and replicate corresponding to the Prelude types is preferable, although within the text package the current names are darned nice.
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

On Fri, Oct 08, 2010 at 07:33:54PM +0200, Daniel Peebles wrote:
I'm not sure I agree with the principle of least surprise here. Dealing correctly with text is hard and the text package presents a nice API for it. Surprise is good, here, I think: many developers out there don't even realize ASCII isn't good enough for text, and even the unicode-aware might not realize that indexing into text or case-insensitive matching are non-trivial operations. Forcing them to fix their code (through compilation errors) seems like a good idea, to me.
A new module, Data.Text.ListCrutch, could contain an exact duplicate of the Data.List functions with warnings attached to them warning (through deprecated pragmas?) users of the library that text is not simply a list of characters, and they should stop treating it as such.
I can see that as an argument for toLower and toUpper, but not the functions listed above, which all present a conceptual model as a sequence of Chars, just like all the functions that do have types that match the lists ones. It's not clear why it's useful to have filter with the same signature as lists and partitionBy different.

On Fri, Oct 8, 2010 at 8:00 AM, Ross Paterson
text base bytestring type in text (or equivalent if absent) --------------------------------------------------------------------------- break - breakSubstring Text -> Text -> (Text, Text) breakBy break break (Char -> Bool) -> Text -> (Text, Text) breakEnd - - Text -> Text -> (Text, Text) - - breakEnd (Char -> Bool) -> Text -> (Text, Text) count - - Text -> Text -> Int - - count Char -> Text -> Int find - - Text -> Text -> [(Text, Text)] findBy find find (Char -> Bool) -> Text -> Maybe Char partitionBy partition - (Char -> Bool) -> Text -> (Text, Text) replicate - - Int -> Text -> Text - replicate replicate Int -> Char -> Text spanBy span span (Char -> Bool) -> Text -> (Text, Text) split - - Text -> Text -> [Text] - - split Char -> Text -> [Text] splitBy - splitWith (Char -> Bool) -> Text -> [Text] unfoldrN - - Int -> (a -> Maybe (Char, a)) -> a -> Text - - unfoldrN Int -> (a -> Maybe (Char, a)) -> a -> (Text, Maybe a) zipWith zipWith - (Char -> Char -> Char) -> Text -> Text -> Text - zipWith zipWith (Char -> Char -> a) -> Text -> Text -> [a]
* The -By suffix has been used for predicate versions in 5 cases here, but not for filter and findIndex.
In the case of filter, that's because a filter function that didn't accept a predicate wouldn't be useful. I added findIndex for completeness back when that was the tack I was taking, but it arguably shouldn't be present at all, since it's subsumed by find. * The find function has no connection with findBy. It ought to have a
name that is the plural of the name of the break function.
Again, this is present for completeness, and makes little obvious sense to retain.

On Sat, Oct 9, 2010 at 9:13 PM, Bryan O'Sullivan
Again, this is present for completeness, and makes little obvious sense to retain.
Oh, and I do realise that a response involving deviating further from the list API by eliminating peculiar functions probably wasn't what you were hoping for, Ross :-)

I have myself in the past implemented a Text-like library as a replacement for the standard list-of-char representation. (Before you ask, it is not publically releasable.) The basic decision I took there was that the Char type does not exist - the only thing available is Text. Another way of saying this, is that Char is simply the subset of all the Texts of size 1. I am not suggesting that the Prelude or ByteString should take this view, but if you proceed to look at the type signatures of the Data.Text package on the basis that Char/Text are the "same" thing, then it may become clearer how to resolve the apparent name/type clashes below. Executive summary: most of the names/signatures turn out to be equivalent, with only a couple highlighted as significantly odd or worth changing. import Prelude hiding (Char) type Char = Text -- At least we can pretend for a while. > text base bytestring type in text (or equivalent if > absent) > --------------------------------------------------------------------------- > break - breakSubstring Text -> Text -> (Text, Text) > breakBy break break (Char -> Bool) -> Text -> > (Text, Text) break break break (Text -> Bool) -> Text -> (Text, Text) The breakSubstring functionality is semantically: breakSubstring x = break (==x) although there may be a more efficient implementation. Proposal: rename Text.break to Text.breakSubstring, and Text.breakBy to Text.break. > breakEnd - - Text -> Text -> (Text, Text) > - - breakEnd (Char -> Bool) -> Text -> > (Text, Text) breakEnd - breakEnd (Text -> Bool) -> Text -> (Text, Text) Proposal: slightly generalise the type of Text.breakEnd. > count - - Text -> Text -> Int > - - count Char -> Text -> Int count - count Text -> Text -> Int Proposal: these are equivalent, no action. > find - - Text -> Text -> [(Text, Text)] I'm afraid that, from the signature alone, I cannot guess what this function does. > findBy find find (Char -> Bool) -> Text -> Maybe > Char find find find (Text -> Bool) -> Text -> Maybe Text Proposal: rename Text.findBy to Text.find. > partitionBy partition - (Char -> Bool) -> Text -> > (Text, Text) partition partition - (Text -> Bool) -> Text -> (Text, Text) Proposal: these are equivalent, no action. > replicate - - Int -> Text -> Text > - replicate replicate Int -> Char -> Text replicate replicate replicate Int -> Text -> Text Proposal: these are equivalent, no action. > spanBy span span (Char -> Bool) -> Text -> > (Text, Text) span span span (Text -> Bool) -> Text -> (Text, Text) Proposal: rename Text.spanBy to Text.span. > split - - Text -> Text -> [Text] > - - split Char -> Text -> [Text] split - split Text -> Text -> [Text] Proposal: these are equivalent, no action, > splitBy - splitWith (Char -> Bool) -> Text -> [Text] splitWith - splitWith (Text -> Bool) -> Text -> [Text] Proposal: these are equivalent, no action. > unfoldrN - - Int -> (a -> Maybe (Char, a)) - > > a -> Text > - - unfoldrN Int -> (a -> Maybe (Char, a)) - > > a -> (Text, Maybe a) unfoldrN - unfoldrN Int -> (a -> Maybe (Text, a)) - > a -> (Text, Maybe a) Proposal: slightly generalise the return type of Text.unfoldrN. > zipWith zipWith - (Char -> Char -> Char) -> Text - > > Text -> Text > - zipWith zipWith (Char -> Char -> a) -> Text -> > Text -> [a] zipWith zipWith zipWith (Text -> Text -> Text) -> Text -> Text -> Text Proposal: This is just a specialised version of the standard zipWith. No action. The only extra function required is to lift the ordinary Char type to become a Text: Proposal: add char :: Prelude.Char -> Text if it does not already exist. I hope this is a useful contribution, if only to spark other ideas for how to resolve the impasse. Regards, Malcolm

On Mon, Oct 11, 2010 at 09:12:33PM +0100, Malcolm Wallace wrote:
of the Data.Text package on the basis that Char/Text are the "same" thing, then it may become clearer how to resolve the apparent name/type clashes below.
But this doesn't allow the multi-Char functions to be added to byetstring/List with the same name. For example:
split - - Text -> Text -> [Text] - - split Char -> Text -> [Text]
split - split Text -> Text -> [Text]
Proposal: these are equivalent, no action,
I can do split "\r\n" or split ", " etc with a text, but not with a bytestring, and if we agree to add that functionality to bytestring then we'll have to call it splitString or splitMany or some other, different, name. Thanks Ian

On October 11, 2010 16:12:33 Malcolm Wallace wrote:
The basic decision I took there was that the Char type does not exist - the only thing available> is Text. Another way of saying this, is that Char is simply the subset of all the Texts of size 1.
I like this. I don't know what issues it gives, but it seems quite clean. Cheers! -Tyson

On Mon, Oct 11, 2010 at 1:12 PM, Malcolm Wallace
I have myself in the past implemented a Text-like library as a replacement for the standard list-of-char representation. (Before you ask, it is not publically releasable.) The basic decision I took there was that the Char type does not exist - the only thing available is Text. Another way of saying this, is that Char is simply the subset of all the Texts of size 1. I am not suggesting that the Prelude or ByteString should take this view, but if you proceed to look at the type signatures of the Data.Text package on the basis that Char/Text are the "same" thing, then it may become clearer how to resolve the apparent name/type clashes below.
Thanks, Malcolm. I do like this principle, and I'm comfortable with its implied further divergence from list/bytestring. The breakSubstring functionality is semantically:
breakSubstring x = break (==x) although there may be a more efficient implementation. Proposal: rename Text.break to Text.breakSubstring, and Text.breakBy to Text.break.
So far, I've been proceeding on the basis that I'd like naming to be consistent and descriptive, and to have more commonly used functions get shorter names than their less commonly used (but possibly more general) cousins. For instance, breakSubstring is descriptive, and it's consistent with bytestring, but it's much longer than break, even though breaking on a fixed string is more common. In this case, length and frequency of use trump the other considerations in my mind.
breakEnd - - Text -> Text -> (Text, Text)
- - breakEnd (Char -> Bool) -> Text -> (Text, Text)
breakEnd - breakEnd (Text -> Bool) -> Text -> (Text, Text)
Proposal: slightly generalise the type of Text.breakEnd.
I think that these generalisations are good ideas. Can someone else please weigh in, preferably the original authors of the H-P inclusion draft (Don, Johan)? findBy find find (Char -> Bool) -> Text -> Maybe Char
find find find (Text -> Bool) -> Text -> Maybe Text
Proposal: rename Text.findBy to Text.find.
I agree with the change of type, not so much with the naming. That's roughly my stance in the other cases, too.
I hope this is a useful contribution, if only to spark other ideas for how to resolve the impasse.
Definitely, thank you.

On 17/10/2010, at 21:25, Bryan O'Sullivan wrote:
On Mon, Oct 11, 2010 at 1:12 PM, Malcolm Wallace
wrote: The breakSubstring functionality is semantically: breakSubstring x = break (==x) although there may be a more efficient implementation. Proposal: rename Text.break to Text.breakSubstring, and Text.breakBy to Text.break.
So far, I've been proceeding on the basis that I'd like naming to be consistent and descriptive, and to have more commonly used functions get shorter names than their less commonly used (but possibly more general) cousins. For instance, breakSubstring is descriptive, and it's consistent with bytestring, but it's much longer than break, even though breaking on a fixed string is more common. In this case, length and frequency of use trump the other considerations in my mind.
FWIW, I take almost exactly the opposite approach with vector. I try to follow the list/array interface as closely as possible even in the presence of more frequently used but subtly different operations. My rationale is that typing a few extra characters is vastly preferable to having to search through the docs to find out what this particular library calls this particular function. I also think breakSubstring would be an unfortunate name because (a) it's too long and (b) other collection-oriented libraries can't use this name for this operation if they don't operate on strings. How about breakSub or breakOn or something like that. Roman

On Sunday 17 October 2010 22:56:04, Roman Leshchinskiy wrote:
On 17/10/2010, at 21:25, Bryan O'Sullivan wrote:
On Mon, Oct 11, 2010 at 1:12 PM, Malcolm Wallace
wrote: The breakSubstring functionality is semantically: breakSubstring x = break (==x) although there may be a more efficient implementation. Proposal: rename Text.break to Text.breakSubstring, and Text.breakBy to Text.break.
So far, I've been proceeding on the basis that I'd like naming to be consistent and descriptive, and to have more commonly used functions get shorter names than their less commonly used (but possibly more general) cousins. For instance, breakSubstring is descriptive, and it's consistent with bytestring, but it's much longer than break, even though breaking on a fixed string is more common. In this case, length and frequency of use trump the other considerations in my mind.
FWIW, I take almost exactly the opposite approach with vector. I try to follow the list/array interface as closely as possible even in the presence of more frequently used but subtly different operations. My rationale is that typing a few extra characters is vastly preferable to having to search through the docs to find out what this particular library calls this particular function.
Good reasons for both approaches. Altogether, I find the reasons for following existing examples stronger, but not so much as to try hard to make text pursue that approach.
I also think breakSubstring would be an unfortunate name because (a) it's too long and (b) other collection-oriented libraries can't use this name for this operation if they don't operate on strings. How about breakSub or breakOn or something like that.
I called it breakOn in stringsearch.
Roman

On Sun, Oct 17, 2010 at 10:25 PM, Bryan O'Sullivan
On Mon, Oct 11, 2010 at 1:12 PM, Malcolm Wallace
wrote: Proposal: slightly generalise the type of Text.breakEnd.
I think that these generalisations are good ideas. Can someone else please weigh in, preferably the original authors of the H-P inclusion draft (Don, Johan)?
I don't have a strong opinion on the matter. If people agree this is a good idea I'm for it. Johan

On Sun, Oct 17, 2010 at 01:25:52PM -0700, Bryan O'Sullivan wrote:
cousins. For instance, breakSubstring is descriptive, and it's consistent with bytestring, but it's much longer than break, even though breaking on a fixed string is more common. In this case, length and frequency of use trump the other considerations in my mind.
I grepped the reverse deps of text looking for uses of text's break, to see what it was being used for, but didn't find any uses. It's possible I missed some in the noise, though (e.g. there are quite a lot of calls to Prelude's break; we really need tibbe's tool for this). Thanks Ian

On Sat, Nov 06, 2010 at 03:34:16PM +0000, Ian Lynagh wrote:
On Sun, Oct 17, 2010 at 01:25:52PM -0700, Bryan O'Sullivan wrote:
cousins. For instance, breakSubstring is descriptive, and it's consistent with bytestring, but it's much longer than break, even though breaking on a fixed string is more common. In this case, length and frequency of use trump the other considerations in my mind.
I grepped the reverse deps of text looking for uses of text's break, to see what it was being used for, but didn't find any uses. It's possible I missed some in the noise, though (e.g. there are quite a lot of calls to Prelude's break; we really need tibbe's tool for this).
Duncan asked me about split and find. Again, human error could well be a factor here, but I found 2 uses, both of which could equally well use the Char version: estimators-0.1.4/NLP/Probability/Example/Trigram.hs: where words = ["*S1*", "*S2*"] ++ (T.split " " sentence) ++ ["*E1*", "*E2*"] estimators-0.1.4/NLP/Probability/Example/Trigram.hs: mconcat $ map makeTrigrams $ T.split "." $ T.pack sentences Also interesting is this, from MissingH-1.1.0.3/src/Data/List/Utils.hs: > split "," "foo,bar,,baz," -> ["foo", "bar", "", "baz", ""] > split "ba" ",foo,bar,,baz," -> [",foo,","r,,","z,"] split :: Eq a => [a] -> [a] -> [[a]] I found no uses of find. Thanks Ian

On Mon, Oct 11, 2010 at 1:12 PM, Malcolm Wallace
I have myself in the past implemented a Text-like library as a replacement for the standard list-of-char representation. (Before you ask, it is not publically releasable.) The basic decision I took there was that the Char type does not exist - the only thing available is Text. Another way of saying this, is that Char is simply the subset of all the Texts of size 1. I am not suggesting that the Prelude or ByteString should take this view, but if you proceed to look at the type signatures of the Data.Text package on the basis that Char/Text are the "same" thing, then it may become clearer how to resolve the apparent name/type clashes below.
Executive summary: most of the names/signatures turn out to be equivalent, with only a couple highlighted as significantly odd or worth changing.
import Prelude hiding (Char) type Char = Text -- At least we can pretend for a while.
text base bytestring type in text (or equivalent if absent)
--------------------------------------------------------------------------- break - breakSubstring Text -> Text -> (Text, Text) breakBy break break (Char -> Bool) -> Text -> (Text, Text)
break break break (Text -> Bool) -> Text -> (Text, Text)
The breakSubstring functionality is semantically: breakSubstring x = break (==x) although there may be a more efficient implementation. Proposal: rename Text.break to Text.breakSubstring, and Text.breakBy to Text.break.
breakEnd - - Text -> Text -> (Text, Text) - - breakEnd (Char -> Bool) -> Text -> (Text, Text)
breakEnd - breakEnd (Text -> Bool) -> Text -> (Text, Text)
Proposal: slightly generalise the type of Text.breakEnd.
count - - Text -> Text -> Int - - count Char -> Text -> Int
count - count Text -> Text -> Int
Proposal: these are equivalent, no action.
find - - Text -> Text -> [(Text, Text)]
I'm afraid that, from the signature alone, I cannot guess what this function does.
findBy find find (Char -> Bool) -> Text -> Maybe Char
find find find (Text -> Bool) -> Text -> Maybe Text
Proposal: rename Text.findBy to Text.find.
partitionBy partition - (Char -> Bool) -> Text -> (Text, Text)
partition partition - (Text -> Bool) -> Text -> (Text, Text)
Proposal: these are equivalent, no action.
replicate - - Int -> Text -> Text - replicate replicate Int -> Char -> Text
replicate replicate replicate Int -> Text -> Text
Proposal: these are equivalent, no action.
spanBy span span (Char -> Bool) -> Text -> (Text, Text)
span span span (Text -> Bool) -> Text -> (Text, Text)
Proposal: rename Text.spanBy to Text.span.
split - - Text -> Text -> [Text] - - split Char -> Text -> [Text]
split - split Text -> Text -> [Text]
Proposal: these are equivalent, no action,
splitBy - splitWith (Char -> Bool) -> Text -> [Text]
splitWith - splitWith (Text -> Bool) -> Text -> [Text]
Proposal: these are equivalent, no action.
unfoldrN - - Int -> (a -> Maybe (Char, a)) -> a -> Text - - unfoldrN Int -> (a -> Maybe (Char, a)) -> a -> (Text, Maybe a)
unfoldrN - unfoldrN Int -> (a -> Maybe (Text, a)) -> a -> (Text, Maybe a)
Proposal: slightly generalise the return type of Text.unfoldrN.
zipWith zipWith - (Char -> Char -> Char) -> Text -> Text -> Text - zipWith zipWith (Char -> Char -> a) -> Text -> Text -> [a]
zipWith zipWith zipWith (Text -> Text -> Text) -> Text -> Text -> Text
Proposal: This is just a specialised version of the standard zipWith. No action.
The only extra function required is to lift the ordinary Char type to become a Text:
Proposal: add char :: Prelude.Char -> Text if it does not already exist.
I hope this is a useful contribution, if only to spark other ideas for how to resolve the impasse.
Regards, Malcolm _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
I've found that the "split" library (by Brent Yorgey) offers a very nice API for selecting between all of the different types of splitting that may be necessary. Could that API perhaps be adapted for use with ByteString and Text, giving a unified splitting interface for all of the sequences in the Haskell Platform? Alex

On Mon, Oct 4, 2010 at 10:04 AM, Ian Lynagh
Yup, that fixes that case, but this one still segfaults (on a 32bit platform):
Ah yes. I was doing the wrong kind of overflow check, so I got lucky with the first fix on your initial test case. There were a few other Int overflow candidates that I also fixed along the way. What needs a deeper look is the handling of arithmetic in the internal module Data.Text.Internal.Size. On a 32-bit machine, it should be fairly easy to overflow those size calculations, but there are cases where you can overflow and at least in theory still have a safe result. For instance: take 3 (replicate 10 (replicate (maxBound `div` 2) "a")) I'd be fine with this either throwing an error or returning "aaa", but right now its behaviour is not likely to be so friendly. The extra checks needed for overflow detection shouldn't have a noticeable performance impact: there's an expensive div-test-and-branch required, but the calculation should only occur once per fused loop (not per iteration).

I see that the HP proposal to add the 'text' package is listed as a 'defect': http://trac.haskell.org/haskell-platform/ticket/145 I think it's better to set its type to 'enhancement'. Regards, Bas

Do we have a collective opinion about revision control or bug tracking for Platform libraries? I'm about done using darcshttp://www.serpentine.com/blog/2010/10/10/why-i-dont-use-darcs-any-more-much...for anything where I don't need to. I've been thinking about publishing the repo, and hosting bugs, on github instead. Any reason not to?

On 10 October 2010 16:24, Bryan O'Sullivan
Do we have a collective opinion about revision control or bug tracking for Platform libraries? I'm about done using darcs for anything where I don't need to. I've been thinking about publishing the repo, and hosting bugs, on github instead. Any reason not to?
I personally don't like sites such as github, gitorious, patch-tag, etc. for one overriding reason: projects are instantly oriented around the lead developer. This is all fine and good if you are the only developer. However, what happens if (shock, horror!) you decide you want to stop using Haskell? The location of the main source repository is now changed completely (this also happens when someone switches VCS or location of their main repo, and can get annoying when you don't realise this and start writing patches against a year old snapshot when HEAD is completely different). Obviously for some projects people already have them in personal hosting site. However, the nice thing about using code.haskell.org for your Haskell projects is that the main focus of the project is that it's for _Haskell_ and not on the main developer. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On 10/10/10 01:52, Ivan Lazar Miljenovic wrote:
On 10 October 2010 16:24, Bryan O'Sullivan
wrote: Do we have a collective opinion about revision control or bug tracking for Platform libraries? I'm about done using darcs for anything where I don't need to. I've been thinking about publishing the repo, and hosting bugs, on github instead. Any reason not to?
I personally don't like sites such as github, gitorious, patch-tag, etc. for one overriding reason: projects are instantly oriented around the lead developer.
I have been told that while GitHub operates as you describe, with URLs based around a person, that Gitorious deliberately does not and instead has URLs for a *project*.

On 11 October 2010 08:07, Isaac Dupree
On 10/10/10 01:52, Ivan Lazar Miljenovic wrote:
On 10 October 2010 16:24, Bryan O'Sullivan
wrote: Do we have a collective opinion about revision control or bug tracking for Platform libraries? I'm about done using darcs for anything where I don't need to. I've been thinking about publishing the repo, and hosting bugs, on github instead. Any reason not to?
I personally don't like sites such as github, gitorious, patch-tag, etc. for one overriding reason: projects are instantly oriented around the lead developer.
I have been told that while GitHub operates as you describe, with URLs based around a person, that Gitorious deliberately does not and instead has URLs for a *project*.
It doesn't? Cool... (So if I ever decide to switch to git Gitorious is now looking more and more attractive to me than github.) -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Sun, Oct 10, 2010 at 04:52:38PM +1100, Ivan Lazar Miljenovic wrote:
On 10 October 2010 16:24, Bryan O'Sullivan
wrote: Do we have a collective opinion about revision control or bug tracking for Platform libraries? I'm about done using darcs for anything where I don't need to. I've been thinking about publishing the repo, and hosting bugs, on github instead. Any reason not to?
I personally don't like sites such as github, gitorious, patch-tag, etc. for one overriding reason: projects are instantly oriented around the lead developer.
You can create organisations on github nowadays. for example, you could create a haskell-platform organisation, with some users having admin rights over the whole organisation. Then you have the per repository level rights, where you can set who are maintainers. Personally, i don't see the lead developer oriented view as a problem; it's not that hard to update couple of links (cabal homepage/repository link on hackage for example) to point to a new lead developer, an organisation, a project page, or another site altogether. -- Vincent

On 11 October 2010 09:06, Vincent Hanquez
Personally, i don't see the lead developer oriented view as a problem; it's not that hard to update couple of links (cabal homepage/repository link on hackage for example) to point to a new lead developer, an organisation, a project page, or another site altogether.
... except we can't edit .cabal files that are already there with links to the wrong master repository. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Mon, Oct 11, 2010 at 09:31:37AM +1100, Ivan Lazar Miljenovic wrote:
On 11 October 2010 09:06, Vincent Hanquez
wrote: Personally, i don't see the lead developer oriented view as a problem; it's not that hard to update couple of links (cabal homepage/repository link on hackage for example) to point to a new lead developer, an organisation, a project page, or another site altogether.
... except we can't edit .cabal files that are already there with links to the wrong master repository.
Isn't that exactly the same when you have a new maintainer. surely this involve uploading a new cabal file with the new maintainer stated in there. Also you can't edit already uploaded hackage cabal files, you still can (should ?) always make a point release to advertise to the world that you're the new maintainer and the fact the repo url has changed. But again, if there's strong feeling about the personal view vs the project view, again, there's the way to solve that problem on github too. -- Vincent

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 10/10/10 18:06 , Vincent Hanquez wrote:
Personally, i don't see the lead developer oriented view as a problem; it's not that hard to update couple of links (cabal homepage/repository link on hackage for example) to point to a new lead developer, an organisation, a project page, or another site altogether.
How about people's existing checkouts? - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkyyRgkACgkQIn7hlCsL25VD4ACdGwpcZnsnALmK/0ZgM3KaT5pE emAAn2suBoI4dBhZrFb22/n1H/J0NYjS =R2UM -----END PGP SIGNATURE-----

On Sun, Oct 10, 2010 at 07:02:33PM -0400, Brandon S Allbery KF8NH wrote:
How about people's existing checkouts?
I believe that's the same if a maintainer or a project change the place where it host his source repo. To my mind, It's only affecting developers and if they are involved they should probably know that they need to update their source URL; I expect that normal users are probably using cabal to get their stuff and would be unaffected by any of these. -- Vincent

Ivan Lazar Miljenovic
On 10 October 2010 16:24, Bryan O'Sullivan
wrote: Do we have a collective opinion about revision control or bug tracking for Platform libraries? I'm about done using darcs for anything where I don't need to. I've been thinking about publishing the repo, and hosting bugs, on github instead. Any reason not to?
I personally don't like sites such as github, gitorious, patch-tag, etc. for one overriding reason: projects are instantly oriented around the lead developer.
A counterexample: http://github.com/snapframework/snap-server
And as someone else pointed out, github has "organization" accounts now.
G
--
Gregory Collins

On Mon, Oct 11, 2010 at 4:38 PM, Gregory Collins
Ivan Lazar Miljenovic
writes: I personally don't like sites such as github, gitorious, patch-tag, etc. for one overriding reason: projects are instantly oriented around the lead developer.
A counterexample: http://github.com/snapframework/snap-server
And as someone else pointed out, github has "organization" accounts now.
Here's another example. The Clojure programming language is hosted on GitHub as an organization. http://github.com/clojure I'd like to merge network-bytestring into network, but since network is hosted in darcs that would mean having to switch to darcs in development flow which I prefer not to. I could just move network to GitHub, but I'd feel a bit bad about doing that since it's a community maintained library. If we had a Haskell organization on GitHub the package wouldn't have to be hosted under my name and other people could have commit access as well. P.S. I saw that text recently (with the latest release) moved off code.haskell.org to GitHub/BitBucket. Johan

I just created a haskell organization on github. Just email me with your
github username and I'll add you to it.
On Fri, Oct 22, 2010 at 4:52 AM, Johan Tibell
On Mon, Oct 11, 2010 at 4:38 PM, Gregory Collins
wrote: Ivan Lazar Miljenovic
writes: I personally don't like sites such as github, gitorious, patch-tag, etc. for one overriding reason: projects are instantly oriented around the lead developer.
A counterexample: http://github.com/snapframework/snap-server
And as someone else pointed out, github has "organization" accounts now.
Here's another example. The Clojure programming language is hosted on GitHub as an organization.
I'd like to merge network-bytestring into network, but since network is hosted in darcs that would mean having to switch to darcs in development flow which I prefer not to.
I could just move network to GitHub, but I'd feel a bit bad about doing that since it's a community maintained library. If we had a Haskell organization on GitHub the package wouldn't have to be hosted under my name and other people could have commit access as well.
P.S. I saw that text recently (with the latest release) moved off code.haskell.org to GitHub/BitBucket.
Johan _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

bos:
Do we have a collective opinion about revision control or bug tracking for Platform libraries? I'm about done using darcs for anything where I don't need to. I've been thinking about publishing the repo, and hosting bugs, on github instead. Any reason not to?
People can use whatever they want, in my view. -- Don
participants (25)
-
Alexander Dunlap
-
Bas van Dijk
-
Brandon S Allbery KF8NH
-
Bryan O'Sullivan
-
Bulat Ziganshin
-
Daniel Fischer
-
Daniel Peebles
-
Don Stewart
-
Duncan Coutts
-
Edward Kmett
-
Gregory Collins
-
Ian Lynagh
-
Isaac Dupree
-
Ivan Lazar Miljenovic
-
Johan Tibell
-
Krasimir Angelov
-
Malcolm Wallace
-
Max Bolingbroke
-
Roman Leshchinskiy
-
Ross Paterson
-
Sittampalam, Ganesh
-
Sterling Clover
-
Tyson Whitehead
-
Vincent Hanquez
-
wren ng thornton