Re: [Haskell-cafe] Re: String vs ByteString

I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes.
In the first iteration of the Text package, UTF-16 was chosen because it had a nice balance of arithmetic overhead and space. The arithmetic for UTF-8 started to have serious performance impacts in situations where the entire document was outside ASCII (i.e. a Russian or Arabic document), but UTF-16 was still relatively compact, compared to both the UTF-32 and String alternatives. This, however, obviously does not represent your use case. I don't know if your use case is the more common one (though it seems likely). The underlying principles of Text should work fine with UTF-8. It has changed a lot since its original writing (thanks to some excellent tuning and maintenance by bos), including some more efficient binary arithmetic. The situation may have changed with respect to the performance limitations of UTF-8, or there may be room for it and a UTF-16 version. Any takers for implementing a UTF-8 version and comparing the two?
A large fraction - probably most - textual data isn't natural language text, but data formatted in textual form, and these formats are typically restricted to ASCII (except for a few text fields).
For instance, a typical project for me might be 10-100GB of data, mostly in various text formats, "real" text only making up a few percent of this. The combined (all languages) Wikipedia is 2G words, probably less than 20GB.
Being agnostic about string encoding - viz. treating it as bytes - works okay, but it would be nice to allow Unicode in the bits that actually are text, like string fields and labels and such.
Is your point that ASCII characters take up the same amount of space (i.e. 16 bits) as higher code points? Do you have any comparisons that quantify how much this affects your ability to process text in real terms? Does it make it too slow? Infeasible memory-wise?

Hello Tom, Tuesday, August 17, 2010, 2:09:09 PM, you wrote:
In the first iteration of the Text package, UTF-16 was chosen because it had a nice balance of arithmetic overhead and space. The arithmetic for UTF-8 started to have serious performance impacts in situations where the entire document was outside ASCII (i.e. a Russian or Arabic document), but UTF-16 was still relatively compact
i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

2010/8/17 Bulat Ziganshin
Hello Tom,
<snip>
i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package?
Bulat, Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint. -- Tom

Tom Harper
2010/8/17 Bulat Ziganshin
: Hello Tom,
<snip>
i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package?
Bulat,
Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint.
Just like Char is capable of encoding any valid Unicode codepoint. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Ivan Lazar Miljenovic wrote:
Tom Harper
writes: 2010/8/17 Bulat Ziganshin
: Hello Tom, <snip>
i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package? Bulat,
Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint.
Just like Char is capable of encoding any valid Unicode codepoint.
Char is not an encoding, right?

Miguel Mitrofanov
Ivan Lazar Miljenovic wrote:
Tom Harper
writes: 2010/8/17 Bulat Ziganshin
: Hello Tom, <snip>
i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package? Bulat,
Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint.
Just like Char is capable of encoding any valid Unicode codepoint.
Char is not an encoding, right?
No, but in GHC at least it corresponds to a Unicode codepoint. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

"Ivan" == Ivan Lazar Miljenovic
writes:
Char is not an encoding, right?
Ivan> No, but in GHC at least it corresponds to a Unicode codepoint. I don't think this is right, or shouldn't be right, anyway.. Surely it stands for a character. Unicode codepoints include non-characters such as the surrogate codepoints used by UTF-16 to map non-BMP codepoints to pairs of 16-bit codepoints. I don't think you ought to be able to see a surrogate codepoint as a Char. -- Colin Adams Preston Lancashire () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

Colin Paul Adams
Char is not an encoding, right?
Ivan> No, but in GHC at least it corresponds to a Unicode codepoint.
I don't think this is right, or shouldn't be right, anyway.. Surely it stands for a character. Unicode codepoints include non-characters such as the surrogate codepoints used by UTF-16 to map non-BMP codepoints to pairs of 16-bit codepoints.
Prelude> (toEnum 0xD800) :: Char '\55296'
I don't think you ought to be able to see a surrogate codepoint as a Char.
This is a bit confusing. From the Unicode glossary: - Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).] - Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) (2) A value, or position, for a character, in any coded character set.
From Wikipedia on UTF-16:
Unicode and ISO/IEC 10646 do not, and will never, assign characters to any of the code points in the U+D800–U+DFFF range, so an individual code unit from a surrogate pair does not ever represent a character. So: A Char holds a code point, that is, a value from 0 to 0x10FFFF16. Some of these values do not correspond to Unicode characters. As far as I can tell, a surrogate pair in UTF-16 is both two (surrogate) code points of two bytes each, as well as a single code point encoded as four bytes. Implementations seem to differ about what the length of a string containing surrogate pairs is. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Tue, Aug 17, 2010 at 12:54, Ivan Lazar Miljenovic < ivan.miljenovic@gmail.com> wrote:
Tom Harper
writes: 2010/8/17 Bulat Ziganshin
: Hello Tom,
<snip>
i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package?
Bulat,
Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint.
Just like Char is capable of encoding any valid Unicode codepoint.
Unless a Char in Haskell is 32 bits (or at least more than 16 bits) it con NOT encode all Unicode points. -Tako

On Tue, Aug 17, 2010 at 1:05 PM, Bulat Ziganshin
Hello Tako,
Tuesday, August 17, 2010, 3:03:20 PM, you wrote:
Unless a Char in Haskell is 32 bits (or at least more than 16 bits) it con NOT encode all Unicode points.
it's 32 bit
Like Bulat said it's 32 bit. It's *defined* as being the Unicode code point number. It has no relation to e.g. char in C. -- Johan

Tako Schotanus
On Tue, Aug 17, 2010 at 12:54, Ivan Lazar Miljenovic < ivan.miljenovic@gmail.com> wrote:
Tom Harper
writes: 2010/8/17 Bulat Ziganshin
: Hello Tom,
<snip>
i don't understand what you mean. are you support all 2^20 codepoints in Data.Text package?
Bulat,
Yes, its internal representation is UTF-16, which is capable of encoding *any* valid Unicode codepoint.
Just like Char is capable of encoding any valid Unicode codepoint.
Unless a Char in Haskell is 32 bits (or at least more than 16 bits) it con NOT encode all Unicode points.
http://www.haskell.org/onlinereport/lexemes.html -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Tue, Aug 17, 2010 at 12:39 PM, Bulat Ziganshin wrote: Hello Tom, Tuesday, August 17, 2010, 2:09:09 PM, you wrote: In the first iteration of the Text package, UTF-16 was chosen because
it had a nice balance of arithmetic overhead and space. The
arithmetic for UTF-8 started to have serious performance impacts in
situations where the entire document was outside ASCII (i.e. a Russian
or Arabic document), but UTF-16 was still relatively compact i don't understand what you mean. are you support all 2^20 codepoints
in Data.Text package? Yes, UTF-16 can represent all Unicode code points, using surrogate pairs.
-- Johan
participants (8)
-
Bulat Ziganshin
-
Colin Paul Adams
-
Ivan Lazar Miljenovic
-
Johan Tibell
-
Ketil Malde
-
Miguel Mitrofanov
-
Tako Schotanus
-
Tom Harper