
Hi all, I'm using Tagsoup to strip data out of some rather large XML files. Since the files are large I'm using ByteString, but that leads me to wonder what is the best way to handle clashes between Prelude functions like putStrLn and the ByteString versions? Anyone have any suggestions for doing this as neatly as possible? Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Hi Erik,
On Fri, Aug 13, 2010 at 1:32 PM, Erik de Castro Lopo
wrote:
Since the files are large I'm using ByteString, but that leads me to wonder what is the best way to handle clashes between Prelude functions like putStrLn and the ByteString versions?
Anyone have any suggestions for doing this as neatly as possible?
Use qualified imports, like so: import qualified Data.ByteString as B main = B.putStrLn $ B.pack "test" Cheers, Johan

On Fri, Aug 13, 2010 at 2:42 PM, Johan Tibell
Hi Erik,
On Fri, Aug 13, 2010 at 1:32 PM, Erik de Castro Lopo
wrote:
Since the files are large I'm using ByteString, but that leads me to wonder what is the best way to handle clashes between Prelude functions like putStrLn and the ByteString versions?
Anyone have any suggestions for doing this as neatly as possible?
Use qualified imports, like so:
import qualified Data.ByteString as B
main = B.putStrLn $ B.pack "test"
If you want to pack a String into a ByteString, you'll need to import
Data.ByteString.Char8 instead. Michael

On Fri, Aug 13, 2010 at 1:47 PM, Michael Snoyman
Use qualified imports, like so:
import qualified Data.ByteString as B
main = B.putStrLn $ B.pack "test"
If you want to pack a String into a ByteString, you'll need to import
Data.ByteString.Char8 instead.
Very true. That's what I get for using a random example without testing it first. -- Johan

Just import the ByteString module qualified. In other words:
import qualified Data.ByteString as S
or for lazy bytestrings:
import qualified Data.ByteString.Lazy as L
Cheers,
Michael
On Fri, Aug 13, 2010 at 2:32 PM, Erik de Castro Lopo
wrote:
Hi all,
I'm using Tagsoup to strip data out of some rather large XML files.
Since the files are large I'm using ByteString, but that leads me to wonder what is the best way to handle clashes between Prelude functions like putStrLn and the ByteString versions?
Anyone have any suggestions for doing this as neatly as possible?
Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi, Why don't you use the Data.Rope library ? The asymptotic complexities are way better than those of the ByteString functions. PE El 13/08/2010, a las 07:32, Erik de Castro Lopo escribió:
Hi all,
I'm using Tagsoup to strip data out of some rather large XML files.
Since the files are large I'm using ByteString, but that leads me to wonder what is the best way to handle clashes between Prelude functions like putStrLn and the ByteString versions?
Anyone have any suggestions for doing this as neatly as possible?
Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/ _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Fri, Aug 13, 2010 at 4:03 PM, Pierre-Etienne Meunier < pierreetienne.meunier@gmail.com> wrote:
Hi,
Why don't you use the Data.Rope library ? The asymptotic complexities are way better than those of the ByteString functions.
PE
For some operations. I'd expect it to be a constant factor slower on average though. -- Johan

I'm interested to see this kind of open debate on performance,
especially about libraries that provide widely used data structures
such as strings.
One of the more puzzling aspects of Haskell for newbies is the large
number of libraries that appear to provide similar/duplicate
functionality.
The Haskell Platform deals with this to some extent, but it seems to
me that if there are new libraries that appear to provide performance
boosts over more widely used libraries, it would be best if the new
code gets incorporated into the existing more widely used libraries
rather than creating more code to maintain / choose from.
I think that open debate about performance trade-offs could help
consolidate the libraries.
Kevin
On Aug 13, 4:08 pm, Johan Tibell
On Fri, Aug 13, 2010 at 4:03 PM, Pierre-Etienne Meunier <
pierreetienne.meun...@gmail.com> wrote:
Hi,
Why don't you use the Data.Rope library ? The asymptotic complexities are way better than those of the ByteString functions.
PE
For some operations. I'd expect it to be a constant factor slower on average though.
-- Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe

On Fri, Aug 13, 2010 at 4:24 PM, Kevin Jardine
I'm interested to see this kind of open debate on performance, especially about libraries that provide widely used data structures such as strings.
One of the more puzzling aspects of Haskell for newbies is the large number of libraries that appear to provide similar/duplicate functionality.
The Haskell Platform deals with this to some extent, but it seems to me that if there are new libraries that appear to provide performance boosts over more widely used libraries, it would be best if the new code gets incorporated into the existing more widely used libraries rather than creating more code to maintain / choose from.
I think that open debate about performance trade-offs could help consolidate the libraries.
Kevin
I agree. Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text. Those libraries have benchmarks and have been well tuned by experienced Haskelleres and should be the fastest and most memory compact in most cases. There are still a few cases where String beats Text but they are being worked on as we speak. Cheers, Johan

On Fri, Aug 13, 2010 at 4:43 PM, Johan Tibell
On Fri, Aug 13, 2010 at 4:24 PM, Kevin Jardine
wrote: I'm interested to see this kind of open debate on performance, especially about libraries that provide widely used data structures such as strings.
One of the more puzzling aspects of Haskell for newbies is the large number of libraries that appear to provide similar/duplicate functionality.
The Haskell Platform deals with this to some extent, but it seems to me that if there are new libraries that appear to provide performance boosts over more widely used libraries, it would be best if the new code gets incorporated into the existing more widely used libraries rather than creating more code to maintain / choose from.
I think that open debate about performance trade-offs could help consolidate the libraries.
Kevin
I agree.
Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text. Those libraries have benchmarks and have been well tuned by experienced Haskelleres and should be the fastest and most memory compact in most cases. There are still a few cases where String beats Text but they are being worked on as we speak.
How about the case for text which is guaranteed to be in ascii/latin1? ByteString again?
Cheers, Johan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Work is punishment for failing to procrastinate effectively.

On Friday 13 August 2010 17:25:58, Gábor Lehel wrote:
How about the case for text which is guaranteed to be in ascii/latin1? ByteString again?
If you can be sure that that won't change anytime soon, definitely. Bonus points if you can write the code so that later changing to e.g. Data.Text requires only a change of imports.

2010/8/13 Gábor Lehel
How about the case for text which is guaranteed to be in ascii/latin1? ByteString again?
If you know it's text and not binary data you are working with, you should still use Data.Text. There are a few good reasons. 1. The API is more correct. For instance, if you use Text.toUpper on a string containing latin1 "ß" (eszett, sharp S), you'll get the two-character sequence "SS", which is correct. Using Char8.map Char.toUpper here gives the wrong answer. 2. In many cases, the API is easier to use, because it's oriented towards using text data, instead of being a port of the list API. 3. Some commonly used functions, such as substring searching, are *way*faster than their ByteString counterparts.

2010/8/13 Bryan O'Sullivan
2010/8/13 Gábor Lehel
How about the case for text which is guaranteed to be in ascii/latin1?
ByteString again?
If you know it's text and not binary data you are working with, you should still use Data.Text. There are a few good reasons.
1. The API is more correct. For instance, if you use Text.toUpper on a string containing latin1 "ß" (eszett, sharp S), you'll get the two-character sequence "SS", which is correct. Using Char8.map Char.toUpper here gives the wrong answer. 2. In many cases, the API is easier to use, because it's oriented towards using text data, instead of being a port of the list API. 3. Some commonly used functions, such as substring searching, are *way*faster than their ByteString counterparts.
These are all good reasons. An even more important reason is type safety:
A function that receives a Text argument has the guaranteed that the input is valid Unicode. A function that receives a ByteString doesn't have that guarantee and if validity is important the function must perform a validity check before operating on the data. If the function does not validate the input the function might crash or, even worse, write invalid data to disk or some other data store, corrupting the application data. This is a bit of a subtle point that you really only see once systems get large. Even though you might pay for the conversion from ByteString to Text you might make up for that by avoiding several validity checks down the road. Cheers, Johan

On Friday 13 August 2010 17:57:36, Bryan O'Sullivan wrote:
3. Some commonly used functions, such as substring searching, are *way*faster than their ByteString counterparts.
That's an unfortunate example. Using the stringsearch package, substring searching in ByteStrings was considerably faster than in Data.Text in my tests. Replacing substrings blew Data.Text to pieces even, with a factor of 10-65 between ByteString and Text (and much smaller memory footprint). stringsearch (Data.ByteString.Lazy.Search): $ ./bmLazy +RTS -s -RTS ../../bigfile Gutenberg Hutzenzwerg > /dev/null ./bmLazy ../../bigfile Gutenberg Hutzenzwerg +RTS -s 92,045,816 bytes allocated in the heap 31,908 bytes copied during GC 103,368 bytes maximum residency (1 sample(s)) 39,992 bytes maximum slop 2 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 158 collections, 0 parallel, 0.01s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 0.07s ( 0.17s elapsed) GC time 0.01s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.08s ( 0.17s elapsed) %GC time 10.5% (2.1% elapsed) Alloc rate 1,353,535,321 bytes per MUT second Productivity 89.5% of total user, 40.1% of total elapsed Data.Text.Lazy: $ ./textLazy +RTS -s -RTS ../../bigfile Gutenberg Hutzenzwerg > /dev/null ./textLazy ../../bigfile Gutenberg Hutzenzwerg +RTS -s 4,916,133,652 bytes allocated in the heap 6,721,496 bytes copied during GC 12,961,776 bytes maximum residency (58 sample(s)) 12,788,968 bytes maximum slop 39 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 8774 collections, 0 parallel, 0.70s, 0.73s elapsed Generation 1: 58 collections, 0 parallel, 0.03s, 0.03s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 9.87s ( 10.23s elapsed) GC time 0.73s ( 0.75s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 10.60s ( 10.99s elapsed) %GC time 6.9% (6.9% elapsed) Alloc rate 497,956,181 bytes per MUT second bigfile is a ~75M file. The point of the more adequate API for text manipulation stands, of course. Cheers, Daniel

This back and forth on performance is great!
I often see ByteString used where Text is theoretically more
appropriate (eg. the Snap web framework) and it would be good to get
these performance issues ironed out so people feel more comfortable
using the right tool for the job based upon API rather than
performance.
Many other languages have two major formats for strings (binary and
text) and it would be great if performance improvements for ByteString
and Text allowed the same kind of convergence for Haskell.
Kevin
On Aug 13, 7:53 pm, "Bryan O'Sullivan"
On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer
wrote: That's an unfortunate example. Using the stringsearch package, substring searching in ByteStrings was considerably faster than in Data.Text in my tests.
Interesting. Got a test case so I can repro and fix? :-)
_______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe

On Friday 13 August 2010 19:53:37, Bryan O'Sullivan wrote:
On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer
wrote: That's an unfortunate example. Using the stringsearch package, substring searching in ByteStrings was considerably faster than in Data.Text in my tests.
Interesting. Got a test case so I can repro and fix? :-)
Sure, use http://norvig.com/big.txt (~6.2M), cat it together a few times to test on larger files. ByteString code (bmLazy.hs): ---------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import System.Environment (getArgs) import qualified Data.ByteString.Char8 as C import qualified Data.ByteString.Lazy as L import Data.ByteString.Lazy.Search main :: IO () main = do (file : pat : _) <- getArgs let !spat = C.pack pat work = indices spat L.readFile file >>= print . length . work ---------------------------------------------------------------- Data.Text.Lazy (textLazy.hs): ---------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module Main (main) where import System.Environment (getArgs) import qualified Data.Text.Lazy as T import qualified Data.Text.Lazy.IO as TIO import Data.Text.Lazy.Search main :: IO () main = do (file : pat : _) <- getArgs let !spat = T.pack pat work = indices spat TIO.readFile file >>= print . length . work ---------------------------------------------------------------- (Data.Text.Lazy.Search is of course not exposed by default ;), I use text-0.7.2.1) Some local timings: 1. real words in a real text file: $ time ./textLazy big.txt the 92805 0.59user 0.00system 0:00.61elapsed 97%CPU $ time ./bmLazy big.txt the92805 0.02user 0.01system 0:00.04elapsed 104%CPU $ time ./textLazy big.txt and 43587 0.56user 0.01system 0:00.58elapsed 100%CPU $ time ./bmLazy big.txt and 43587 0.02user 0.01system 0:00.03elapsed 88%CPU $ time ./textLazy big.txt mother 317 0.44user 0.01system 0:00.46elapsed 99%CPU $ time ./bmLazy big.txt mother 317 0.00user 0.01system 0:00.02elapsed 69%CPU $ time ./textLazy big.txt deteriorate 2 0.37user 0.00system 0:00.38elapsed 98%CPU $ time ./bmLazy big.txt deteriorate 2 0.01user 0.01system 0:00.02elapsed 114%CPU $ time ./textLazy big.txt "Project Gutenberg" 177 0.37user 0.00system 0:00.38elapsed 97%CPU $ time ./bmLazy big.txt "Project Gutenberg" 177 0.00user 0.01system 0:00.01elapsed 100%CPU 2. periodic pattern in a file of 33.4M of aaaaa: $ time ./bmLazy ../AAA aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 34999942 1.22user 0.04system 0:01.30elapsed 97%CPU $ time ./textLazy ../AAA aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 593220 3.07user 0.03system 0:03.14elapsed 98%CPU Oh, that's closer, but text doesn't find overlapping matches, well, we can do that too (replace indices with nonOverlappingIndices): $ time ./noBMLazy ../AAA aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 593220 0.18user 0.04system 0:00.23elapsed 97%CPU Yeah, that's more like it :D

On Friday 13 August 2010 19:53:37 you wrote:
On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer
wrote: That's an unfortunate example. Using the stringsearch package, substring searching in ByteStrings was considerably faster than in Data.Text in my tests.
Interesting. Got a test case so I can repro and fix? :-)
Just occurred to me, a lot of the difference is due to the fact that text has to convert a ByteString to Text on reading the file, so I timed that by reading the file and counting the chunks, that took text 0.21s for big.txt vs. Data.ByteString.Lazy's 0.01s. So for searching in-memory strings, subtract about 0.032s/MB from the difference - it's still large.

Surely a lot of real world text processing programs are IO intensive?
So if there is no native Text IO and everything needs to be read in /
written out as ByteString data converted to/from Text this strikes me
as a major performance sink.
Or is there native Text IO but just not in your example?
Kevin
On Aug 13, 8:57 pm, Daniel Fischer
Just occurred to me, a lot of the difference is due to the fact that text has to convert a ByteString to Text on reading the file, so I timed that by reading the file and counting the chunks, that took text 0.21s for big.txt vs. Data.ByteString.Lazy's 0.01s. So for searching in-memory strings, subtract about 0.032s/MB from the difference - it's still large. _______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe

On Friday 13 August 2010 21:32:12, Kevin Jardine wrote:
Surely a lot of real world text processing programs are IO intensive? So if there is no native Text IO and everything needs to be read in / written out as ByteString data converted to/from Text this strikes me as a major performance sink.
Or is there native Text IO but just not in your example?
Outdated information, sorry. Up to ghc-6.10, text's IO was via ByteString, it's no longer so. However, the native Text IO is (of course) much slower than ByteString IO due to the need of en/decoding.
Kevin
On Aug 13, 8:57 pm, Daniel Fischer
wrote: Just occurred to me, a lot of the difference is due to the fact that text has to convert a ByteString to Text on reading the file, so I timed that by reading the file and counting the chunks, that took text 0.21s for big.txt vs. Data.ByteString.Lazy's 0.01s. So for searching in-memory strings, subtract about 0.032s/MB from the difference - it's still large.

On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer
That's an unfortunate example. Using the stringsearch package, substring searching in ByteStrings was considerably faster than in Data.Text in my tests.
Daniel, thanks again for bringing up this example! It turned out that quite a lot of the difference in performance was due to an inadvertent space leak in the text search code. With a single added bang pattern, the execution time and space usage both improved markedly. There is of course still lots of room for improvement, but having test cases like this helps immensely.

* Bryan O'Sullivan:
If you know it's text and not binary data you are working with, you should still use Data.Text. There are a few good reasons.
1. The API is more correct. For instance, if you use Text.toUpper on a string containing latin1 "ß" (eszett, sharp S), you'll get the two-character sequence "SS", which is correct. Using Char8.map Char.toUpper here gives the wrong answer.
Data.Text ist still incorrect for some scripts: $ LANG=tr_TR.UTF-8 ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done. Prelude> import Data.Text Prelude Data.Text> toUpper $ pack "i" Loading package array-0.3.0.0 ... linking ... done. Loading package containers-0.3.0.0 ... linking ... done. Loading package deepseq-1.1.0.0 ... linking ... done. Loading package bytestring-0.9.1.5 ... linking ... done. Loading package text-0.7.2.1 ... linking ... done. "I" Prelude Data.Text>

On Sat, Aug 14, 2010 at 12:15 PM, Florian Weimer
* Bryan O'Sullivan:
If you know it's text and not binary data you are working with, you should still use Data.Text. There are a few good reasons.
1. The API is more correct. For instance, if you use Text.toUpper on a string containing latin1 "ß" (eszett, sharp S), you'll get the two-character sequence "SS", which is correct. Using Char8.map Char.toUpper here gives the wrong answer.
Data.Text ist still incorrect for some scripts:
$ LANG=tr_TR.UTF-8 ghci GHCi, version 6.12.1: http://www.haskell.org/ghc/ :? for help Loading package ghc-prim ... linking ... done. Loading package integer-gmp ... linking ... done. Loading package base ... linking ... done. Prelude> import Data.Text Prelude Data.Text> toUpper $ pack "i" Loading package array-0.3.0.0 ... linking ... done. Loading package containers-0.3.0.0 ... linking ... done. Loading package deepseq-1.1.0.0 ... linking ... done. Loading package bytestring-0.9.1.5 ... linking ... done. Loading package text-0.7.2.1 ... linking ... done. "I" Prelude Data.Text>
Yes. We need locale support for that one. I think Bryan is planning to add it. -- Johan

Johan Tibell wrote:
Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text. Those libraries have benchmarks and have been well tuned by experienced Haskelleres and should be the fastest and most memory compact in most cases. There are still a few cases where String beats Text but they are being worked on as we speak.
Which one do you use for strings in HTML or XML in which UTF-8 has become the commonly accepted standard encoding? It's text, not binary, so I should choose Data.Text. But isn't there a performance penalty for translating from Data.Text's internal 16-bit encoding to UTF-8? http://tools.ietf.org/html/rfc3629 http://www.utf8.com/ Regards, Sean

On Friday 13 August 2010 17:27:32, Sean Leather wrote:
Which one do you use for strings in HTML or XML in which UTF-8 has become the commonly accepted standard encoding? It's text, not binary, so I should choose Data.Text. But isn't there a performance penalty for translating from Data.Text's internal 16-bit encoding to UTF-8?
Yes there is. Whether using String, Data.Text or Data.ByteString + Data.ByteString.UTF8 is the best choice depends on what you do. Test and then decide.

Quoth Sean Leather
Which one do you use for strings in HTML or XML in which UTF-8 has become the commonly accepted standard encoding? It's text, not binary, so I should choose Data.Text. But isn't there a performance penalty for translating from Data.Text's internal 16-bit encoding to UTF-8?
Use both? I am not familiar with Text, but UTF-8 is pretty awkward, and I will sure look into Text before wasting any time trying to fine-tune my ByteString handling for UTF-8. But in practice only a fraction of my data input will be manipulated in an encoding-sensitive context. I'm thinking _all_ data is binary, and accordingly all inputs are ByteString; conversion to Text will happen as needed for ... uh, wait, is there a conversion from ByteString to Text? Well, if not, no doubt that's coming. Donn Cave, donn@avvanta.com

Sean Leather wrote:
Which one do you use for strings in HTML or XML in which UTF-8 has become the commonly accepted standard encoding?
UTF-8 is only becoming the standard for non-CJK languages. We are told by members of our community in CJK countries that UTF-8 is not widely adopted there, and there is no sign that it ever will be. And one should be aware that the proportion of CJK in global Internet traffic is growing quickly. But of course, that is still a legitimate question for some situations in which full internationalization will not be needed. Regards, Yitz

Yitzchak Gale wrote:
Sean Leather wrote:
Which one do you use for strings in HTML or XML in which UTF-8 has become the commonly accepted standard encoding?
UTF-8 is only becoming the standard for non-CJK languages. We are told by members of our community in CJK countries that UTF-8 is not widely adopted there, and there is no sign that it ever will be. And one should be aware that the proportion of CJK in global Internet traffic is growing quickly.
So then, what is the standard? Being not familiar with this area, I googled a bit, and I don't see a consensus. But I also noticeably don't see UTF-16. So, if this is the case, then a similar question still arises for CJK text: What format/library to use for it (assuming one doesn't want a performance penalty for translating between Data.Text's internal format and the target format)? It appears that there are no ideal answers to such questions. Regards, Sean

Sean Leather wrote:
So then, what is the standard? ...I also noticeably don't see UTF-16.
Right there are a handful of language-specific 16-bit encodings that are popular, from what I understand.
So, if this is the case, then a similar question still arises for CJK text: What format/library to use for it (assuming one doesn't want a performance penalty for translating between Data.Text's internal format and the target format)? It appears that there are no ideal answers to such questions.
Right. If you know you'll be in a specific encoding - whether UTF-8, Latin1, one of the CJK encodings, or whatever, it might sometimes make sense to skip Data.Text and do the IO as raw bytes using ByteString and then encode/decode manually only when needed. Otherwise, Data.Text is probably the way to go.

On Sat, Aug 14, 2010 at 3:46 PM, Sean Leather
So then, what is the standard?
There isn't one. There are many national standards: - China: GB-2312, GBK and GB18030 - Taiwan: Big5 - Japan: JIS and Shift-JIS (0208 and 0213 variants) and EUC-JP - Korea: KS-X-2001, EUC-KR, and ISO-2022-KR In general, Unicode uptake is increasing rapidly: http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html Being not familiar with this area, I googled a bit, and I don't see a
consensus. But I also noticeably don't see UTF-16. So, if this is the case, then a similar question still arises for CJK text: What format/library to use for it (assuming one doesn't want a performance penalty for translating between Data.Text's internal format and the target format)?
In my opinion, this "performance penalty" hand-wringing is mostly silly. We're talking a pretty small factor of performance difference in most of these cases. Even the biggest difference, between ByteString and String, is usually much less than a factor of 100. Your absolute first concern should be correctness, for which you should (a) use text and (b) assume that any performance issues are being actively worked on, especially if you report concrete problems and how to reproduce them. In the unlikely event that you need to support non-Unicode encodings, they are readily available via text-icu. The only significant change to the text API that lies ahead is an introduction of locale support in a few critical places, so that we can do the right thing for languages like Turkish.

On Sat, Aug 14, 2010 at 16:38, Bryan O'Sullivan
In my opinion, this "performance penalty" hand-wringing is mostly silly. We're talking a pretty small factor of performance difference in most of these cases. Even the biggest difference, between ByteString and String, is usually much less than a factor of 100.
This attitude towards performance, that it doesn't really matter as long as something happens *eventually*, is what pushed me away from Python and towards more performant languages like Haskell in the first place. Sure, you might not notice a few extra seconds when parsing some file on your quad-core developer desktop, but those seconds turn into 20 minutes of lost battery power when running on smaller systems. Having to convert the internal data structure between [Char], (Ptr Word16), and (Ptr Word8) can quickly cause user-visible problems. Libraries which will (by their nature) see heavy use, such as "bytestring" and "text", ought to have much attention paid to their performance characteristics. A factor of 2-3x might be the difference between being able to use a library, and having to rewrite its functionality to be more efficient.
In the unlikely event that you need to support non-Unicode encodings, they are readily available via text-icu.
Unfortunately, text-icu is hardcoded to use libicu 4.0, which was released well over a year ago and is no longer available in many distributions. I sent you a patch to support newer versions a few months ago, but never received a response. Meanwhile, libicu is up to 4.4 by now.

On Sat, Aug 14, 2010 at 5:11 PM, John Millikin
This attitude towards performance, that it doesn't really matter as long as something happens *eventually*, is what pushed me away from Python and towards more performant languages like Haskell in the first place.
But wait, wait - I'm not at all contending that performance doesn't matter! In fact, I spent a couple of months working on criterion precisely because I want to base my own performance work on extremely solid data, and to afford the same opportunity to other people. So far in this thread, there's been exactly one performance number posted, by Daniel. Not only have I already thanked him for it, I immediately used (and continue to use) it to improve the performance of the text library in that instance. More broadly, what I am recommending is simple: - Use a good library. - Expect good performance out of it. - Measure the performance you get out of your application. - If it's not good enough, and the fault lies in a library you chose, report a bug and provide a test case. In the case of the text library, it is often (but not always) competitive with bytestring, and I improve it when I can, especially when given test cases. My goal is for it to be the obvious choice on several fronts: - Cleanliness of API, where it's already better, but could still improve - Performance, which is not quite where I want it (target: parity with, or better than, bytestring) - Quality, where text has slightly more test coverage than bytestring However, just text alone is a big project, and I could get a lot more done if I was both coding and integrating patches than if coding alone :-) So patches are very welcome.
In the unlikely event that you need to support non-Unicode encodings,
they are readily available via text-icu.
Unfortunately, text-icu is hardcoded to use libicu 4.0, which was released well over a year ago and is no longer available in many distributions. I sent you a patch to support newer versions a few months ago, but never received a response.
Yes, that's quite embarrassing, and I am quite apologetic about it, especially since I just asked for help in the preceding paragraph. If it's any help, there's a story behind my apparent sloth: I overenthusiastically accepted a patch from another contributor a few months before yours, and his changes left the text-icu darcs repo in a mess from which I have yet to rescue it. I do still have your patch, and I'll probably abandon my attempts to clean up the other one, as it was more work than I cared to clean it up.

Quoth "Bryan O'Sullivan"
In the case of the text library, it is often (but not always) competitive with bytestring, and I improve it when I can, especially when given test cases. My goal is for it to be the obvious choice on several fronts:
- Cleanliness of API, where it's already better, but could still improve - Performance, which is not quite where I want it (target: parity with, or better than, bytestring) - Quality, where text has slightly more test coverage than bytestring
That sounds great, and I'm looking forward to using Text in my application - at least, where I think it would help with respect to correctness. I can't imagine I would unpack all my data right off the socket, or disk, and use Text throughout my application, because I'm skeptical that unpacking megabytes of data from 8 to 16 bits can be done without noticeable impact on resources. I wouldn't imagine I would be filing a bug report on that, because it's a given - if I have a big data load, obviously I should be using ByteString. Am I confused about this? It's why I can't see Text ever being simply the obvious choice. [Char] will continue to be the obvious choice if you want a functional data type that supports pattern matching etc. ByteString will continue to be the obvious choice for big data loads. We'll have a three way choice between programming elegance, correctness and efficiency. If Haskell were more than just a research language, this might be its most prominent open sore, don't you think? Donn Cave, donn@avvanta.com

On Sat, Aug 14, 2010 at 22:07, Donn Cave
Am I confused about this? It's why I can't see Text ever being simply the obvious choice. [Char] will continue to be the obvious choice if you want a functional data type that supports pattern matching etc. ByteString will continue to be the obvious choice for big data loads. We'll have a three way choice between programming elegance, correctness and efficiency. If Haskell were more than just a research language, this might be its most prominent open sore, don't you think?
I don't see why [Char] is "obvious" -- you'd never use [Word8] for storing binary data, right? [Char] is popular because it's the default type for string literals, and due to simple inertia, but when there's a type based on packed arrays there's no reason to use the list representation. Also, despite the name, ByteString and Text are for separate purposes. ByteString is an efficient [Word8], Text is an efficient [Char] -- use ByteString for binary data, and Text for...text. Most mature languages have both types, though the choice of UTF-16 for Text is unusual.

Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:
Also, despite the name, ByteString and Text are for separate purposes. ByteString is an efficient [Word8], Text is an efficient [Char] -- use ByteString for binary data, and Text for...text. Most mature languages have both types, though the choice of UTF-16 for Text is unusual.
Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode text representations, I cannot really agree with "unusual". :-) Cheers, Edward

On Sun, Aug 15, 2010 at 8:39 AM, Edward Z. Yang
Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:
Also, despite the name, ByteString and Text are for separate purposes. ByteString is an efficient [Word8], Text is an efficient [Char] -- use ByteString for binary data, and Text for...text. Most mature languages have both types, though the choice of UTF-16 for Text is unusual.
Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode text representations, I cannot really agree with "unusual". :-)
When I'm writing a web app, my code is sitting on a Linux system where the default encoding is UTF-8, communicating with a database speaking UTF-8, receiving request bodies in UTF-8 and sending response bodies in UTF-8. So converting all of that data to UTF-16, just to be converted right back to UTF-8, does seem strange for that purpose.
Remember, Python, .NET and Java are all imperative languages without referential transparency. I doubt saying they do something some way will influence most Haskell coders much ;). Michael

On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman
When I'm writing a web app, my code is sitting on a Linux system where the default encoding is UTF-8, communicating with a database speaking UTF-8, receiving request bodies in UTF-8 and sending response bodies in UTF-8. So converting all of that data to UTF-16, just to be converted right back to UTF-8, does seem strange for that purpose.
Bear in mind that much of the data you're working with can't be readily trusted. UTF-8 coming from the filesystem, the network, and often the database may not be valid. The cost of validating it isn't all that different from the cost of converting it to UTF-16. And of course the internals of Data.Text are all fusion-based, so much of the time you're not going to be allocating UTF-16 arrays at all, but instead creating a pipeline of characters that are manipulated in a tight loop. This eliminates a lot of the additional copying that bytestring has to do, for instance. To give you an idea of how competitive Data.Text can be compared to C code, this is the system's wc command counting UTF-8 characters in a modestly large file: $ time wc -m huge.txt 32443330 real 0.728s This is Data.Text performing the same task: $ time ./FileRead text huge.txt 32443330 real 0.697s

"Bryan" == Bryan O'Sullivan
writes:
Bryan> On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman

Hi Colin,
On Sun, Aug 15, 2010 at 9:34 AM, Colin Paul Adams
But UTF-16 (apart from being an abomination for creating a hole in the codepoint space and making it impossible to ever etxend it) is slow to process compared with UTF-32 - you can't get the nth character in constant time, so it seems an odd choice to me.
Aside: Getting the nth character isn't very useful when working with Unicode text: * Most text processing is linear. * What we consider a character and what Unicode considers a character differs a bit e.g. since Unicode uses combining characters. Cheers, Johan

On Sat, Aug 14, 2010 at 22:39, Edward Z. Yang
Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:
Also, despite the name, ByteString and Text are for separate purposes. ByteString is an efficient [Word8], Text is an efficient [Char] -- use ByteString for binary data, and Text for...text. Most mature languages have both types, though the choice of UTF-16 for Text is unusual.
Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode text representations, I cannot really agree with "unusual". :-)
Python doesn't use UTF-16; on UNIX systems it uses UCS-4, and on WIndows it uses UCS-2. The difference is important because: Python: len("\U0001dd1e") == 2 Haskell: length (pack "\x0001dd1e") Java, .NET, Windows, JavaScript, and some other languages use UTF-16 because when Unicode support was added to these systems, the astral characters had not been invented yet, and 16 bits was enough for the entire Unicode character set. They originally used UCS-2, but then moved to UTF-16 to minimize incompatibilities. Anything based on UNIX generally uses UTF-8, because Unicode support was added later after the problems of UCS-2/UTF-16 had been discovered. C libraries written by UNIX users use UTF-8 almost exclusively -- this includes most language bindings available on Hackage. I don't mean that UTF-16 is itself unusual, but it's a legacy encoding -- there's no reason to use it in new projects. If "text" had been started 15 years ago, I could understand, but since it's still in active development the use of UTF-16 simply adds baggage.

Quoth John Millikin
I don't see why [Char] is "obvious" -- you'd never use [Word8] for storing binary data, right? [Char] is popular because it's the default type for string literals, and due to simple inertia, but when there's a type based on packed arrays there's no reason to use the list representation.
Well, yes, string literals - and pattern matching support, maybe that's the same thing. And I think it's fair to say that [Char] is a natural, elegant match for the language, I mean it leverages your basic Haskell skills if for example you want to parse something fairly simple. So even if ByteString weren't the monumental hassle it is today for simple stuff, String would have at least a little appeal. And if packed arrays really always mattered, [Char] would be long gone. They don't, you can do a lot of stuff with [Char] before it turns into a problem.
Also, despite the name, ByteString and Text are for separate purposes. ByteString is an efficient [Word8], Text is an efficient [Char] -- use ByteString for binary data, and Text for...text. Most mature languages have both types, though the choice of UTF-16 for Text is unusual.
Maybe most mature languages have one or more extra string types hacked on to support wide characters. I don't think it's necessarily a virtue. ByteString vs. ByteString.Char8, where you can choose more or less indiscriminately to treat the data as Char or Word8, seems to me like a more useful way to approach the problem. (Of course, ByteString.Char8 isn't a good way to deal with wide characters correctly, I'm just saying that's where I'd like to find the answer, not in some internal character encoding into which all "text" data must be converted.) Donn Cave, donn@avvanta.com

On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave
Am I confused about this? It's why I can't see Text ever being
simply the obvious choice. [Char] will continue to be the obvious
choice if you want a functional data type that supports pattern matching etc.
Actually, with view patterns, Text is pretty nice to pattern match against: foo (uncons -> Just (c,cs)) = "whee" despam (prefixed "spam" -> Just suffix) = "whee" `mappend` suffix ByteString will continue to be the obvious choice
for big data loads.
Don't confuse "I have big data" with "I need bytes". If you are working with bytes, use bytestring. If you are working with text, outside of a few narrow domains you should use text. We'll have a three way choice between programming
elegance, correctness and efficiency. If Haskell were more than just a research language, this might be its most prominent open sore, don't you think?
No, that's just FUD.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 8/15/10 03:01 , Bryan O'Sullivan wrote:
On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave
mailto:donn@avvanta.com> wrote: We'll have a three way choice between programming elegance, correctness and efficiency. If Haskell were more than just a research language, this might be its most prominent open sore, don't you think? No, that's just FUD.
More to the point, there's nothing elegant about [Char] --- its sole "advantage" is requiring no thought. - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxoBPgACgkQIn7hlCsL25WbWACgz+MXfwL6ly1Euv1X1HD7Gmg8 fO0Anj1LY6CqDyLjr0s5L2M5Okx8ie+/ =eIIs -----END PGP SIGNATURE-----

No, not really. Linked lists are very easy to deal with recursively and Strings automatically work with any already-defined list functions. On Sun, Aug 15, 2010 at 11:17 AM, Brandon S Allbery KF8NH < allbery@ece.cmu.edu> wrote:
More to the point, there's nothing elegant about [Char] --- its sole "advantage" is requiring no thought.

Quoth Bill Atkins
No, not really. Linked lists are very easy to deal with recursively and Strings automatically work with any already-defined list functions.
Yes, they're great - a terrible mistake, for a practical programming language, but if you fail to recognize the attraction, you miss some of the historical lesson on emphasizing elegance and correctness over practical performance. Donn Cave, donn@avvanta.com

Donn Cave wrote:
Quoth Bill Atkins
, No, not really. Linked lists are very easy to deal with recursively and Strings automatically work with any already-defined list functions.
Yes, they're great - a terrible mistake, for a practical programming language, but if you fail to recognize the attraction, you miss some of the historical lesson on emphasizing elegance and correctness over practical performance.
And if you fail to recognise what a grave mistake placing performance before correctness is, you end up with things like buffer overflow exploits, SQL injection attacks, the Y2K bug, programs that can't handle files larger than 2GB or that don't understand Unicode, and so forth. All things that could have been almost trivially avoided if everybody wasn't so hung up on absolute performance at any cost. Sure, performance is a priority. But it should never be the top priority. ;-)

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 8/15/10 13:53 , Andrew Coppin wrote:
injection attacks, the Y2K bug, programs that can't handle files larger than 2GB or that don't understand Unicode, and so forth. All things that could have been almost trivially avoided if everybody wasn't so hung up on absolute performance at any cost.
Now that's a bit unfair; nobody imagined back when lseek() was enshrined in the Unix API that it would still be in use when a (long) wasn't big enough :) (Remember that Unix is itself a practical example of a research platform "avoiding success at any cost" gone horribly wrong.) - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxoK2gACgkQIn7hlCsL25VaHgCcCj8T8Qqfx4Co1lXZCH7BApkW iI8AoNcSabjLso9nXBfujeI+diC8rM78 =FwBb -----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 8/15/10 14:34 , Andrew Coppin wrote:
Brandon S Allbery KF8NH wrote:
(Remember that Unix is itself a practical example of a research platform "avoiding success at any cost" gone horribly wrong.)
I haven't used Erlang myself, but I've heard it described in a similar way. (I don't know how true that actually is...)
Similar case, actually: internal research project with internal practical uses, then got discovered and "productized" by a different internal group. - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxoNoAACgkQIn7hlCsL25XSAgCgtLKTtT8YN99KsArnhW2kMDvh oHcAnR1QrfIaq3hmzqU7yF31NZubEMsR =zpv1 -----END PGP SIGNATURE-----

Quoth Andrew Coppin
And if you fail to recognise what a grave mistake placing performance before correctness is, you end up with things like buffer overflow exploits, SQL injection attacks, the Y2K bug, programs that can't handle files larger than 2GB or that don't understand Unicode, and so forth. All things that could have been almost trivially avoided if everybody wasn't so hung up on absolute performance at any cost.
Sure, performance is a priority. But it should never be the top priority. ;-)
You should never have to choose. Not to belabor the point, but to dismiss all that as the work of morons who weren't as wise as we are, is the same mistake from the other side of the wall - performance counts. If you solve the problem by assigning a priority to one or the other, you aren't solving the problem. Donn Cave, donn@avvanta.com

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 8/15/10 11:25 , Bill Atkins wrote:
No, not really. Linked lists are very easy to deal with recursively and Strings automatically work with any already-defined list functions.
On Sun, Aug 15, 2010 at 11:17 AM, Brandon S Allbery KF8NH
mailto:allbery@ece.cmu.edu> wrote: More to the point, there's nothing elegant about [Char] --- its sole "advantage" is requiring no thought.
Except that it seems to me that a number of functions in Data.List are really functions on Strings and not especially useful on generic lists. There is overlap but it's not as large as might be thought. - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxoFt4ACgkQIn7hlCsL25V+OACfXngN6ZX5L7AL153AkRYDFnqZ jqsAnA3Lem5LioDVS5bc0ADGzHwWsKFE =ehkx -----END PGP SIGNATURE-----

Quoth "Bryan O'Sullivan"
On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave
wrote: ... ByteString will continue to be the obvious choice for big data loads.
Don't confuse "I have big data" with "I need bytes". If you are working with bytes, use bytestring. If you are working with text, outside of a few narrow domains you should use text.
I wonder how many ByteString users are `working with bytes', in the sense you apparently mean where the bytes are not text characters. My impression is that in practice, there is a sizeable contingent out here using ByteString.Char8 and relatively few applications for the Word8 type. Some of it should no doubt move to Text, but the ability to work with native packed data - minimal processing and space requirements, interoperability with foreign code, mmap, etc. - is attractive enough that the choice can be less than obvious. Donn Cave, donn@avvanta.com

On Sun, Aug 15, 2010 at 12:50 PM, Donn Cave
I wonder how many ByteString users are `working with bytes', in the sense you apparently mean where the bytes are not text characters. My impression is that in practice, there is a sizeable contingent out here using ByteString.Char8 and relatively few applications for the Word8 type. Some of it should no doubt move to Text, but the ability to work with native packed data - minimal processing and space requirements, interoperability with foreign code, mmap, etc. - is attractive enough that the choice can be less than obvious.
Using ByteString.Char8 doesn't mean your data isn't a stream of bytes, it means that it is a stream of bytes but for convenience you prefer using Char8 functions. For example, a DNA sequence (AATCGATACATG...) is a stream of bytes, but it is better to write 'A' than 65. But yes, many users of ByteStrings should be using Text. =) Cheers! -- Felipe.

Donn Cave wrote:
I wonder how many ByteString users are `working with bytes', in the sense you apparently mean where the bytes are not text characters. My impression is that in practice, there is a sizeable contingent out here using ByteString.Char8 and relatively few applications for the Word8 type. Some of it should no doubt move to Text, but the ability to work with native packed data - minimal processing and space requirements, interoperability with foreign code, mmap, etc. - is attractive enough that the choice can be less than obvious.
I use ByteString for various binary-processing stuff. I also use it for string-processing, but that's mainly because I didn't know anything else existed. I'm sure lots of other people are using stuff like Data.Binary to serialise raw binary data using ByteString too.

On Sat, Aug 14, 2010 at 6:05 PM, Bryan O'Sullivan
- If it's not good enough, and the fault lies in a library you chose, report a bug and provide a test case.
As a case in point, I took the string search benchmark that Daniel shared
on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file? GNU wc -m: - en_US.UTF-8: 0.701s text 0.7.1.0: - lazy text: 1.959s - strict text: 3.527s darcs HEAD: - lazy text: 0.749s - strict text: 0.927s

On Sunday 15 August 2010 20:04:01, Bryan O'Sullivan wrote:
On Sat, Aug 14, 2010 at 6:05 PM, Bryan O'Sullivan
wrote: - If it's not good enough, and the fault lies in a library you chose, report a bug and provide a test case.
As a case in point, I took the string search benchmark that Daniel shared on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file?
GNU wc -m:
- en_US.UTF-8: 0.701s
text 0.7.1.0:
- lazy text: 1.959s - strict text: 3.527s
darcs HEAD:
- lazy text: 0.749s - strict text: 0.927s
That's great. If that performance difference is a show stopper, one shouldn't go higher-level than C anyway :) (doesn't mean one should stop thinking about further speed-up, though) Out of curiosity, what kind of speed-up did your Friday fix bring to the searching/replacing functions?

On Sunday 15 August 2010 20:53:32, Bryan O'Sullivan wrote:
On Sun, Aug 15, 2010 at 11:39 AM, Daniel Fischer
wrote: Out of curiosity, what kind of speed-up did your Friday fix bring to the searching/replacing functions?
Quite a bit!
text 0.7.1.0 and 0.7.2.1:
- 1.056s
darcs HEAD:
- 0.158s
Awesome :D

Hello Daniel, Sunday, August 15, 2010, 10:39:24 PM, you wrote:
That's great. If that performance difference is a show stopper, one shouldn't go higher-level than C anyway :)
*all* speed measurements that find Haskell is as fast as C, was broken. Let's see: D:\testing>read MsOffice.arc MsOffice.arc 317mb -- Done Time 0.407021 seconds (timer accuracy 0.000000 seconds) Speed 779.505632 mbytes/sec -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hi Bulat, On Monday 16 August 2010 07:35:44, Bulat Ziganshin wrote:
Hello Daniel,
Sunday, August 15, 2010, 10:39:24 PM, you wrote:
That's great. If that performance difference is a show stopper, one shouldn't go higher-level than C anyway :)
*all* speed measurements that find Haskell is as fast as C, was broken.
That's a pretty bold claim, considering that you probably don't know all such measurements ;) But let's get serious. Bryan posted measurements showing the text (HEAD) package's performance within a reasonable factor of wc's. (Okay, he didn't give a complete description of his test, so we can only assume that all participants did the same job. I'm bold enough to assume that.) Lazy text being 7% slower than wc, strict 30%. If you are claiming that his test was flawed (and since the numbers clearly showed Haskell slower thanC, just not much, I suspect you do, otherwise I don't see the point of your post), could you please elaborate why you think it's flawed?
Let's see:
D:\testing>read MsOffice.arc MsOffice.arc 317mb -- Done Time 0.407021 seconds (timer accuracy 0.000000 seconds) Speed 779.505632 mbytes/sec
I see nothing here, not knowing what `read' is. None of read (n), read (2), read (1p), read(3p) makes sense here, so it must be something else. Since it outputs a size in bytes, I doubt that it actually counts characters, like wc -m and, presumably, the text programmes Bryan benchmarked. Just counting bytes, wc and Data.ByteString[.Lazy] can do much faster than counting characters too.

Hi Bulat, On Monday 16 August 2010 07:35:44, Bulat Ziganshin wrote:
Hello Daniel,
Sunday, August 15, 2010, 10:39:24 PM, you wrote:
That's great. If that performance difference is a show stopper, one shouldn't go higher-level than C anyway :)
*all* speed measurements that find Haskell is as fast as C, was broken.
That's a pretty bold claim, considering that you probably don't know all such measurements ;)
[...] If you are claiming that his test was flawed (and since the numbers clearly showed Haskell slower than C, just not much, I suspect you do, otherwise I don't see the point of your post), could you please elaborate why you think it's flawed? Hi Daniel, you are right, the throughput of 'cat' (as proposed by Bulat) is not a fair comparison, and 'all speed measurements favoring haskell are broken' is hardly a reasonable argument. However, 'wc -m' is indeed a rather slow way to count the number of UTF-8 characters. Python, for example, is quite a bit faster (1.60s vs 0.93s for 70M) on my machine[1,2]. Despite of all this, I think the performance of the text
On 16.08.10 14:44, Daniel Fischer wrote: package is very promising, and hope it will improve further! cheers, benedikt [1] A special purpose C implementation (as the one presented here: http://canonical.org/~kragen/strlen-utf8.html) is even faster (0.50), but that's not a fair comparison either. [2] I do not know Python, so maybe there is an even faster way than print len(sys.stdin.readline().decode('utf-8'))

Benedikt Huber
Despite of all this, I think the performance of the text package is very promising, and hope it will improve further!
I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes. A large fraction - probably most - textual data isn't natural language text, but data formatted in textual form, and these formats are typically restricted to ASCII (except for a few text fields). For instance, a typical project for me might be 10-100GB of data, mostly in various text formats, "real" text only making up a few percent of this. The combined (all languages) Wikipedia is 2G words, probably less than 20GB. Being agnostic about string encoding - viz. treating it as bytes - works okay, but it would be nice to allow Unicode in the bits that actually are text, like string fields and labels and such. Due to the sizes involved, I think that in order to efficiently process text-formatted data, UTF-8 is the no-brainer choice for encoding -- certainly in storage, but also for in-memory processing. Unfortunately, there is no clear Data.Text-like effort here. There's (at least): utf8-string - provides utf-8 encoded lazy and strict bytestrings as well as some other data types (and a common class) and System.Environment functionality. utf8-light - provides encoding/decoding to/from (strict?) bytestrings regex-tdfa-utf8 - regular expressions on UTF-8 encoded lazy bytestrings utf8-env - provides an UTF8 aware System.Environment uhexdump - hex dumps for UTF-8 (?) compact-string - support for many different string encodings compact-string-fix - indicates that the above is unmaintained
From a quick glance, it appears that utf8-string is the most complete and well maintained of the crowd, but I could be wrong. It'd be nice if a similar effort as Data.Text has seen could be applied to e.g. utf8-string, to produce a similarly efficient and effective library and allow the deprecation of the others. IMO, this could in time replace .Char8 as the default ByteString string representation. Hackathon, anyone?
-k -- If I haven't seen further, it is by standing in the footprints of giants

On Tue, Aug 17, 2010 at 10:08 AM, Ketil Malde
Benedikt Huber
writes: Despite of all this, I think the performance of the text package is very promising, and hope it will improve further!
I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes.
[..]
From a quick glance, it appears that utf8-string is the most complete and well maintained of the crowd, but I could be wrong. It'd be nice if a similar effort as Data.Text has seen could be applied to e.g. utf8-string, to produce a similarly efficient and effective library and allow the deprecation of the others. IMO, this could in time replace .Char8 as the default ByteString string representation. Hackathon, anyone?
Let me ask the question a different way: what are the motivations for having the text package use UTF-16 internaly? I know that some system APIs in Windows use it (at least, I think they do), and perhaps it's more efficient for certain types of processing, but overall do those benefits outweigh all of the reasons for UTF-8 pointed out in this thread?
Michael

On Tue, Aug 17, 2010 at 9:08 AM, Ketil Malde
Benedikt Huber
writes: Despite of all this, I think the performance of the text package is very promising, and hope it will improve further!
I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes.
It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower. If we could get conclusive evidence that using UTF-16 hurts performance, we could look into changing the internal representation (a major undertaking). What Bryan and I need is benchmarks showing where Data.Text is performing poorly, compare to String or ByteString, so we can investigate the cause(s). Hypothesis are a good starting point for performance improvements, but they're not enough. We need benchmarks and people looking at profiling and compiler output to really understand what's going on. For example, how many know that the Handle implementations copies the input first into a mutable buffer and then into a Text value, for reads less than the buffer size (8k if I remember correctly). One of these copies could be avoided. How do we know that it's using UTF-16 that's our current performance bottleneck and not this extra copy? We need to benchmark, change the code, and then benchmark again. Perhaps the outcome of all the benchmarking and investigation is indeed that UTF-16 is a problem; then we can change the internal encoding. But there are other possibilities, like poorly laid out branches in the generated code. We need to understand what's going on if we are to make progress. A large fraction - probably most - textual data isn't natural language
text, but data formatted in textual form, and these formats are typically restricted to ASCII (except for a few text fields).
For instance, a typical project for me might be 10-100GB of data, mostly in various text formats, "real" text only making up a few percent of this. The combined (all languages) Wikipedia is 2G words, probably less than 20GB.
I think this is an important observation. Cheers, Johan

Hello Johan, Tuesday, August 17, 2010, 12:20:37 PM, you wrote:
I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes.
It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower.
not slower but require 2x more memory. speed is the same since Unicode contains 2^20 codepoints -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On Tue, Aug 17, 2010 at 10:34, Bulat Ziganshin
Hello Johan,
Tuesday, August 17, 2010, 12:20:37 PM, you wrote:
I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes.
It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower.
not slower but require 2x more memory. speed is the same since Unicode contains 2^20 codepoints
This is not entirely correct because it all depends on your data. For western languages is normally holds true that UTF16 occupies twice the memory of UTF8, but for other languages code points might take up to 3 bytes (I thought even 4, but the wikipedia page only mentions 3: http://en.wikipedia.org/wiki/UTF-8). That wikipedia page is a nice read anyway, it mentions some of the advantages and disadvantages of the different encodings. (The complexity of the code that determines the length of an UTF string depends on the encoding for example) Cheers, -Tako

Hello Tako, Tuesday, August 17, 2010, 12:46:35 PM, you wrote:
not slower but require 2x more memory. speed is the same since Unicode contains 2^20 codepoints
This is not entirely correct because it all depends on your data.
of course i mean ascii chars -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hi Bulat,
On Tue, Aug 17, 2010 at 10:34 AM, Bulat Ziganshin wrote: It's not clear to me that using UTF-16 internally does make
Data.Text noticeably slower. not slower but require 2x more memory. speed is the same since
Unicode contains 2^20 codepoints Yes, in theory a program could use as much as 2x the memory. That being
said, most programs don't hold that much text data in memory at any given
point so that might be 2x of a small number. One experiment [1] found it
difficult to show any difference in memory usage at all in Trac when
switching Python's internal representation from UCS2 to UCS4.
So it's not clear to me that using UTF-16 makes the program noticeably
slower or use more memory on a real program.
1. http://www.cmlenz.net/archives/2008/07/the-truth-about-unicode-in-python
Cheers,
Johan

Hello Johan, Tuesday, August 17, 2010, 1:06:30 PM, you wrote:
So it's not clear to me that using UTF-16 makes the program noticeably slower or use more memory on a real program.
it's clear misunderstanding. of course, not every program holds much text data in memory. but some does, and here you will double memory usage -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Johan wrote:
So it's not clear to me that using UTF-16 makes the program noticeably slower or use more memory on a real program.
it's clear misunderstanding. of course, not every program holds much text data in memory. but some does, and here you will double memory usage
I write programs that hold onto quite a good deal of natural language text; a few million words at least. Getting efficient Unicode for that is a high priority. However, all of that text is in Japanese, Chinese, Arabic, Hindi, Urdu,... That's the reason I want Unicode. I'm pretty sure UTF-16 isn't going to be causing any special problems here. For NLP work, any language with a vaguely ASCII format isn't a problem. We've been shoving English and western European languages into a subset of ASCII for years (heck, we don't even allow real parentheses!). For the mostly English files on my harddrive, UTF-8 is a clear win. But when it comes to programming, I'm not so sure. I'd like to see some good benchmarks and a clear explanation of where the costs are. Relying on intuitions is notoriously bad for these kinds of encoding issues. -- Live well, ~wren

Johan Tibell
It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower.
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8. Many applications will get away with streaming over data, retaining only a small part, but some won't. In other cases (e.g. processing CJK text, and perhap also non-Latin1 text), I'm sure it'll be faster - but my (still unsubstantiated) guess is that the difference will be much smaller, and it'll be a case of winning some and losing some - and I'd also conjecture that having 3Gb "real" text (i.e. natural language, as opposed to text-formatted data) is rare. I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8. Alternatively, we can have different libraries with different representations for different purposes, where you'll get another few percent of juice by switching to the most appropriate. Currently the latter approach looks to be in favor, so if we can't have one single library, let us at least aim for a set of libraries with consistent interfaces and optimal performance. Data.Text is great for UTF-16, and I'd like to have something similar for UTF-8. Is all I'm trying to say. -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde
Johan Tibell
writes: It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower.
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8. Many applications will get away with streaming over data, retaining only a small part, but some won't.
Seeing as how the genome just uses 4 base "letters", wouldn't it be better to not treat it as text but use something else? Or do you just mean storage-wise to be able to be read in a text editor, etc. as well (in case someone is trying to do their mad genetic manipulation by hand)? -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Ivan Lazar Miljenovic
Seeing as how the genome just uses 4 base "letters",
Yes, the bulk of the data is not really "text" at all, but each sequence (it's fragmented due to the molecular division into chromosomes, and due to incompleteness) also has a textual header. Generally, the Fasta format looks like this:
sequence-id some arbitrary metadata blah blah ACGATATACGCGCATGCGAT... ..lines and lines of letters...
(As an aside, although there are only four nucleotides (ACGT), there are occasional wildcard characters, the most common being N for aNy nucleotide, but there are defined wildcards for all subsets of the alphabet.)
wouldn't it be better to not treat it as text but use something else?
I generally use ByteStrings, with the .Char8 interface if/when appropriate. This is actually a pretty good choice; even if people use Unicode in the headers, I don't particularly want to care - as long as it is transparent. In some cases, I'd like to, say, search headers for some specific string - in these cases, a nice, tidy, rich, and optimized Data.ByteString(.Lazy).UTF8 would be nice. (But obviously not terribly essential at the moment, since I haven't bothered to test the available options. I guess for my stuff, the (human consumable) text bits are neither very performance intensive, nor large, so I could probably and fairly cheaply wrap relevant operations or fields with Data.Text's {de,en}codeUtf8. And in practice - partly due to lacking software support, I'm sure - it's all ASCII anyway. :-) It'd be nice to have efficient substring searches and regular expression, etc for the sequence data, but often this will be better addressed by more specific algorithms, and in any case, a .Char8 implementation is likely to be more efficient than any gratuitous Unicode encoding.
(in case someone is trying to do their mad genetic manipulation by hand)?
You'd be surprised what a determined biologist can achive, armed only with Word, Excel, and a reckless disregard for surmountability. -k -- If I haven't seen further, it is by standing in the footprints of giants

Hello, Ketil Malde!
On Tue, Aug 17, 2010 at 8:02 AM, Ketil Malde
Ivan Lazar Miljenovic
writes: Seeing as how the genome just uses 4 base "letters",
Yes, the bulk of the data is not really "text" at all, but each sequence (it's fragmented due to the molecular division into chromosomes, and due to incompleteness) also has a textual header. Generally, the Fasta format looks like this:
>sequence-id some arbitrary metadata blah blah ACGATATACGCGCATGCGAT... ..lines and lines of letters...
(As an aside, although there are only four nucleotides (ACGT), there are occasional wildcard characters, the most common being N for aNy nucleotide, but there are defined wildcards for all subsets of the alphabet.)
As someone who knows and uses your bio package, I'm almost certain that Text really isn't the right data type for representing everything. Certainly *not* for the genomic data itself. In fact, a representation using 4 bits per base (4 nucleotides plus 12 other characters, such as gaps as aNy) is easy to represent using ByteStrings with two bases per byte and should halve the space requirements. However, the header of each sequence is text, in the sense of human language text, and ideally should be represented using Text. In other words, the sequence data type[1] currently is defined as: type SeqData = Data.ByteString.Lazy.ByteString type QualData = Data.ByteString.Lazy.ByteString data Sequence t = Seq !SeqData !SeqData !(Maybe QualData) [1] http://hackage.haskell.org/packages/archive/bio/0.4.6/doc/html/Bio-Sequence-... where the meaning is that in 'Seq header seqdata qualdata', 'header' would be something like "sequence-id some arbitrary metadata blah blah" and 'seqdata' would be "ACGATATACGCGCATGCGAT". But perhaps we should really have: type SeqData = Data.ByteString.Lazy.ByteString type QualData = Data.ByteString.Lazy.ByteString type HeaderData = Data.Text.Text -- strict is prolly a good choice here data Sequence t = Seq !HeaderData !SeqData !(Maybe QualData) Semantically, this is the right choice, putting Text where there is text. We can read everything with ByteStrings and then use[2] decodeUtf8 :: ByteString -> Text [2] http://hackage.haskell.org/packages/archive/text/0.7.2.1/doc/html/Data-Text-... only for the header bits. There is only one problem in this approach, UTF-8 for the input FASTA file would be hardcoded. Considering that probably nobody will be using UTF-16 or UTF-32 for the whole FASTA file, there remains only UTF-8 (from which ASCII is just a special case) and other 8-bits encondings (such as ISO8859-1, Shift-JIS, etc.). I haven't seen a FASTA file with characters outside the ASCII range yet, but I guess the choice of UTF-8 shouldn't be a big problem.
wouldn't it be better to not treat it as text but use something else?
I generally use ByteStrings, with the .Char8 interface if/when appropriate. This is actually a pretty good choice; even if people use Unicode in the headers, I don't particularly want to care - as long as it is transparent. In some cases, I'd like to, say, search headers for some specific string - in these cases, a nice, tidy, rich, and optimized Data.ByteString(.Lazy).UTF8 would be nice. (But obviously not terribly essential at the moment, since I haven't bothered to test the available options. I guess for my stuff, the (human consumable) text bits are neither very performance intensive, nor large, so I could probably and fairly cheaply wrap relevant operations or fields with Data.Text's {de,en}codeUtf8. And in practice - partly due to lacking software support, I'm sure - it's all ASCII anyway. :-)
Oh, so I didn't read this paragraph closely enough :). In this e-mail I'm basically agreeing with your thoughts here =). And what do you think about creating a real SeqData data type with two bases per byte? In terms of processing speed I guess there will be a small penalty, but if you need to have large quantities of base pairs in memory this would double your capacity =). Cheers, -- Felipe.

Felipe Lessa
And what do you think about creating a real SeqData data type with two bases per byte? In terms of processing speed I guess there will be a small penalty, but if you need to have large quantities of base pairs in memory this would double your capacity =).
Yes, this is interesting in some cases. Obvious downsides would be a separate data type for protein sequences (20 characters, plus some wildcards), and more complicated string comparison (when a match is off by one). Oh, and lower case is sometimes used to signify less "important" regions, like repeats. Another choice is the 2bit format (used by BLAT, and supported in Bio for input/output, but not internally), which stores the alphabet proper directly in 2bit quantities, and uses a separate lists for gaps, lower case masking, and Ns (and is obviously extensible to wildcards). Too much extending, and you're likely to lose any benefit, though. Basically, it boils down to a set of different tradeoffs, and I think ByteString is a fairly good choice in *most* cases, and it deals - if not particularly elegantly, then at least fairly effectively with various conventions, like lower-casing or wild cards. -k -- If I haven't seen further, it is by standing in the footprints of giants

"Ketil" == Ketil Malde
writes:
Ketil> Johan Tibell

Ketil Malde wrote:
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language.
I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8.
Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias. In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future.
Alternatively, we can have different libraries with different representations for different purposes, where you'll get another few percent of juice by switching to the most appropriate.
Currently the latter approach looks to be in favor, so if we can't have one single library, let us at least aim for a set of libraries with consistent interfaces and optimal performance. Data.Text is great for UTF-16, and I'd like to have something similar for UTF-8. Is all I'm trying to say.
I agree. Thanks, Yitz

Ketil Malde wrote:
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language.
I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8.
Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias.
In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future.
I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about
On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale

Michael Snoyman
I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere,
*ahem* http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and... -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Michael Snoyman
writes: I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere,
*ahem* http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and...
I was talking about the contents of the files, not the file names or how
On Tue, Aug 17, 2010 at 2:20 PM, Ivan Lazar Miljenovic < ivan.miljenovic@gmail.com> wrote: the system calls work. I know at least on Windows, Linux and FreeBSD, if you open up the default text editor, type in a few letters and hit save, the file will not be in UTF-16. Michael

Michael Snoyman wrote:
On Tue, Aug 17, 2010 at 2:20 PM, Ivan Lazar Miljenovic < ivan.miljenovic@gmail.com> wrote:
Michael Snoyman
writes: I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, *ahem* http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and...
I was talking about the contents of the files, not the file names or how the system calls work. I know at least on Windows, Linux and FreeBSD, if you open up the default text editor, type in a few letters and hit save, the file will not be in UTF-16.
OSX, TextEdit, plain text mode is UTF-16 and cannot be altered. Also, if you load a UTF-8 plain text file in TextEdit it will be garbled because it assumes UTF-16. For html files you can choose the encoding, which defaults to UTF-8. But for plain text, it's always UTF-16. OSX is also fond of UTF-16 in Cocoa... -- Live well, ~wren

On Tue, Aug 17, 2010 at 13:00, Michael Snoyman
On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale
wrote: Ketil Malde wrote:
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language.
As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.
Regardless of the outcome of that investigation (which in itself is interesting) I have to agree with Yitzchak that the human genome (or any other ASCII based data that is not ncessarily a representation of written human language) is not a good fir for the Text package. A package like this should IMHO be good at handling human language, as much of them as possible, and support the common operations as efficiently as possible: sorting, upper/lowercase (where those exist), find word boundaries, whatever. Parsing some kind of file containing the human genome and the like I think would be much better served by a package focusing on handling large streams of bytes. No encodings to worry about, no parsing of the stream determine code points, no calculations determine string lengths. If you need to convert things to upper/lower case or do sorting you can just fall back on simple ASCII processing, no need to depend on a package dedicated to human text processing. I do think that in-memory processing of Unicode is better served with UTF16 than UTF8 because except en very rare circumstances you can just treat the text as an array of Char. You can't do that for UTF8 so the efficiency of the algorithmes would suffer. I also think that the memory problem is much easier worked around (for example by dividing the problem in smaller parts) than sub-optimal string processing because of increased complexity. -Tako

Michael Snoyman
As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16.
With the danger of sounding ... alphabetist? as well as belaboring a point I agree is irrelevant (the storage format): I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages. UTF-16 uses two bytes for (most) CJK ideograms, and (all, I think) characters in Western and other phonetic scripts. UTF-8 uses one to two bytes for a lot of Western alphabets, but three for CJK ideograms. Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram, while an ASCII letter is about six bits. Thus, the information density of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to 15/16 vs 6/16 for UTF-16. In other words a given document translated between Chinese and English should occupy roughly the same space in UTF-8, but be 2.5 times longer in English for UTF-16. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Tue, Aug 17, 2010 at 13:40, Ketil Malde
Michael Snoyman
writes: As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16.
With the danger of sounding ... alphabetist? as well as belaboring a point I agree is irrelevant (the storage format):
I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages.
Thing is that here you're only talking about size optimizations, for somebody having to handle a lot of international texts (and I'm not necessarily talking about Chinese or Japanese here) it would be important that this is handled in the most efficient way possible, because in the end storing and retrieving you only do once each while maybe doing a lot of processing in between. And the on-disk storage or the over-the-wire format might very well be different than the in-memory format. Each can be selected for what it's best at. I'll repeat here that in my opinion a Text package should be good at handling text, human text, from whatever country. If I need to handle large streams of ASCII I'll use something else. :) Cheers, -Tako

On Aug 17, 1:55 pm, Tako Schotanus
I'll repeat here that in my opinion a Text package should be good at handling text, human text, from whatever country. If I need to handle large streams of ASCII I'll use something else.
I would mostly agree. However, a key use case for Text in my view is web applications. A typical web page is composed of mostly ASCII tags interspersed with (often) non-ASCII content. I would argue that it is crucial that Text work well in this case. "Well" in this context means "efficient enough to be used as the standard format for most web applications". I do not know enough about text algorithms to recommend a more detailed meaning for "efficient enough". Kevin

Ketil Malde wrote:
I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages.
Quite true.
[...speculative calculation from which we conclude that] a given document translated between Chinese and English should occupy roughly the same space in UTF-8, but be 2.5 times longer in English for UTF-16.
Could be. We really need data on that. If it's practical to maintain different backends with identical public APIs and different internal encodings, that would be the best. After a few years of widespread usage, would know a lot more. Regards, Yitz

Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
UTF-16 "segments" in it list of strict text elements :) Then big chunks of
western text will be encoded efficiently, and same with CJK! Not sure what
to do about strict Data.Text though :)
On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde
Michael Snoyman
writes: As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16.
With the danger of sounding ... alphabetist? as well as belaboring a point I agree is irrelevant (the storage format):
I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages. UTF-16 uses two bytes for (most) CJK ideograms, and (all, I think) characters in Western and other phonetic scripts. UTF-8 uses one to two bytes for a lot of Western alphabets, but three for CJK ideograms.
Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram, while an ASCII letter is about six bits. Thus, the information density of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to 15/16 vs 6/16 for UTF-16. In other words a given document translated between Chinese and English should occupy roughly the same space in UTF-8, but be 2.5 times longer in English for UTF-16.
-k -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Someone mentioned earlier that IHHO all of this messing around with
encodings and conversions should be handled transparently, and I guess
you could do something like have the internal representation be along
the lines of Either UTF8 UTF16 (or perhaps even more encodings), and
then implement every function in the API equivalently for each
representation (with only the performance characteristics differing),
with input/output functions being specialized for each encoding, and
then only do a conversion when necessary or explicitly requested. But
I assume that would have other problems (like the implicit conversions
causing hard-to-track-down performance bugs when they're triggered
unintentionally).
On Tue, Aug 17, 2010 at 3:21 PM, Daniel Peebles
Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and UTF-16 "segments" in it list of strict text elements :) Then big chunks of western text will be encoded efficiently, and same with CJK! Not sure what to do about strict Data.Text though :)
On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde
wrote: Michael Snoyman
writes: As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16.
With the danger of sounding ... alphabetist? as well as belaboring a point I agree is irrelevant (the storage format):
I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages. UTF-16 uses two bytes for (most) CJK ideograms, and (all, I think) characters in Western and other phonetic scripts. UTF-8 uses one to two bytes for a lot of Western alphabets, but three for CJK ideograms.
Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram, while an ASCII letter is about six bits. Thus, the information density of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to 15/16 vs 6/16 for UTF-16. In other words a given document translated between Chinese and English should occupy roughly the same space in UTF-8, but be 2.5 times longer in English for UTF-16.
-k -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Work is punishment for failing to procrastinate effectively.

(Actually, this seems more like a job for a type class.)
2010/8/17 Gábor Lehel
Someone mentioned earlier that IHHO all of this messing around with encodings and conversions should be handled transparently, and I guess you could do something like have the internal representation be along the lines of Either UTF8 UTF16 (or perhaps even more encodings), and then implement every function in the API equivalently for each representation (with only the performance characteristics differing), with input/output functions being specialized for each encoding, and then only do a conversion when necessary or explicitly requested. But I assume that would have other problems (like the implicit conversions causing hard-to-track-down performance bugs when they're triggered unintentionally).
On Tue, Aug 17, 2010 at 3:21 PM, Daniel Peebles
wrote: Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and UTF-16 "segments" in it list of strict text elements :) Then big chunks of western text will be encoded efficiently, and same with CJK! Not sure what to do about strict Data.Text though :)
On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde
wrote: Michael Snoyman
writes: As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16.
With the danger of sounding ... alphabetist? as well as belaboring a point I agree is irrelevant (the storage format):
I'd point out that it seems at least as unfair to optimize for CJK at the cost of Western languages. UTF-16 uses two bytes for (most) CJK ideograms, and (all, I think) characters in Western and other phonetic scripts. UTF-8 uses one to two bytes for a lot of Western alphabets, but three for CJK ideograms.
Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram, while an ASCII letter is about six bits. Thus, the information density of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to 15/16 vs 6/16 for UTF-16. In other words a given document translated between Chinese and English should occupy roughly the same space in UTF-8, but be 2.5 times longer in English for UTF-16.
-k -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Work is punishment for failing to procrastinate effectively.
-- Work is punishment for failing to procrastinate effectively.

On Tue, Aug 17, 2010 at 03:21:32PM +0200, Daniel Peebles wrote:
Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and UTF-16 "segments" in it list of strict text elements :) Then big chunks of western text will be encoded efficiently, and same with CJK! Not sure what to do about strict Data.Text though :)
If space is really a concern, there should be a varient that uses LZO or some other fast compression algorithm that allows concatination as the back end. <ranty thing to follow> That said, there is never a reason to use UTF-16, it is a vestigial remanent from the brief period when it was thought 16 bits would be enough for the unicode standard, any defense of it nowadays is after the fact justification for having accidentally standardized on it back in the day. When people chose to use the 16 bit representation, it was because they wanted a one-to-one mapping between codepoints and units of computation, which has many advantages. However, this is no longer true, if the one-to-one mapping is important then nowadays you use ucs-4, otherwise, you use utf8. If space is very important then you work with compressed text. In practice a mix of the two is fairly ideal. John -- John Meacham - ⑆repetae.net⑆john⑈ - http://notanumber.net/

On Wed, Aug 18, 2010 at 2:12 AM, John Meacham
<ranty thing to follow> That said, there is never a reason to use UTF-16, it is a vestigial remanent from the brief period when it was thought 16 bits would be enough for the unicode standard, any defense of it nowadays is after the fact justification for having accidentally standardized on it back in the day.
This is false. Text uses UTF-16 internally as early benchmarks indicated that it was faster. See Tom Harper's response to the other thread that was spawned of this thread by Ketil. Text continues to be UTF-16 today because * no one has written a benchmark that shows that UTF-8 would be faster *for use in Data.Text*, and * no one has written a patch that converts Text to use UTF-8 internally. I'm quite frustrated by this whole discussion; there's lots of talking, no coding, and only a little benchmarking (of web sites, not code). This will get us nowhere. Cheers, Johan

Johan Tibell
Text continues to be UTF-16 today because
* no one has written a benchmark that shows that UTF-8 would be faster *for use in Data.Text*, and * no one has written a patch that converts Text to use UTF-8 internally.
I'm quite frustrated by this whole discussion; there's lots of talking, no coding, and only a little benchmarking (of web sites, not code). This will get us nowhere.
This was my impression as well. If someone desperately wants Text to use UTF-8 internally, why not help code such a change rather than just waving the suggestion around in the air? -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Wed, Aug 18, 2010 at 2:39 PM, Johan Tibell
On Wed, Aug 18, 2010 at 2:12 AM, John Meacham
wrote: <ranty thing to follow> That said, there is never a reason to use UTF-16, it is a vestigial remanent from the brief period when it was thought 16 bits would be enough for the unicode standard, any defense of it nowadays is after the fact justification for having accidentally standardized on it back in the day.
This is false. Text uses UTF-16 internally as early benchmarks indicated that it was faster. See Tom Harper's response to the other thread that was spawned of this thread by Ketil.
Text continues to be UTF-16 today because
* no one has written a benchmark that shows that UTF-8 would be faster *for use in Data.Text*, and * no one has written a patch that converts Text to use UTF-8 internally.
I'm quite frustrated by this whole discussion; there's lots of talking, no coding, and only a little benchmarking (of web sites, not code). This will get us nowhere.
Here's my response to the two points:
* I haven't written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I'll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been "don't use bytestring, it's the wrong datatype, text will get fixed," which is quite underwhelming. * Since the prevailing attitude has been such a disregard to any facts shown thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I'll be chastised by the community, I'll stick with this approach for the moment. Now if you tell me that text would consider applying a UTF-8 patch, that would be a different story. But I don't have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking. Michael

On 18 August 2010 15:04, Michael Snoyman
For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking.
And the answer to that is, yes but only if we have good reason to believe it will actually be faster, and that's where we're most interested in benchmarks rather than hand waving. As Johan and others have said, the original choice to use UTF16 was based on benchmarks showing it was faster (than UTF8 or UTF32). So if we want to counter that then we need either to argue that these were the wrong choice of benchmarks that do not reflect real usage, or that with better implementations that the balance would shift. Now there is an interesting argument to claim that we spend more time shovling strings about than we do actually processing them in any interesting way and therefore that we should pick benchmarks that reflect that. This would then shift the balance to favour the internal representation being identical to some particular popular external representation --- even if that internal representation is slower for many processing tasks. Duncan

Hi Michael,
On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman
Here's my response to the two points:
* I haven't written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I'll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been "don't use bytestring, it's the wrong datatype, text will get fixed," which is quite underwhelming.
I went through all the emails you sent on with topic "String vs ByteString" and "Re: String vs ByteString" and I can't find a single benchmark. I do agree with you that * UTF-8 is more compact than UTF-16, and * UTF-8 is by far the most used encoding on the web. and that establishes a reasonable *theoretical* argument for why switching to UTF-8 might be faster. What I'm looking for is a program that shows a big difference so we can validate the hypothesis. As Duncan mentioned we already ran some benchmarks early on the showed the opposite. Someone posted a benchmark earlier in this thread and Bryan addressed the issue raised by that poster. We want more of those.
* Since the prevailing attitude has been such a disregard to any facts shown thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I'll be chastised by the community, I'll stick with this approach for the moment.
I'm not sure this discussion has surfaced that many facts. What we do have is plenty of theories. I can easily add some more: * GHC is not doing a good job laying out the branches in the validation code that does arithmetic on the input byte sequence, to validate the input and compute the Unicode code point that should be streamed using fusion. * The differences in text and bytestring's fusion framework get in the way of some optimization in GHC (text uses a more sophisticated fusion frameworks that handles some cases bytestring can't according to Bryan). * Lingering space leaks is hurting performance (Bryan plugged one already). * The use of a polymorphic loop state in the fusion framework gets in the way of unboxing. * Extraneous copying in the Handle implementation slows down I/O. All these are plausible reasons why Text might perform worse than ByteString. We need find out why ones are true by benchmarking and looking at the generated Core.
Now if you tell me that text would consider applying a UTF-8 patch, that would be a different story. But I don't have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking.
I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on. Cheers, Johan

On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell
Hi Michael,
On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman
wrote: Here's my response to the two points:
* I haven't written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I'll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been "don't use bytestring, it's the wrong datatype, text will get fixed," which is quite underwhelming.
I went through all the emails you sent on with topic "String vs ByteString" and "Re: String vs ByteString" and I can't find a single benchmark. I do agree with you that
* UTF-8 is more compact than UTF-16, and * UTF-8 is by far the most used encoding on the web.
and that establishes a reasonable *theoretical* argument for why switching to UTF-8 might be faster.
What I'm looking for is a program that shows a big difference so we can validate the hypothesis. As Duncan mentioned we already ran some benchmarks early on the showed the opposite. Someone posted a benchmark earlier in this thread and Bryan addressed the issue raised by that poster. We want more of those.
Sorry, I thought I'd sent these out. While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic: http://www.snoyman.com/blog/entry/bigtable-benchmarks/ http://www.snoyman.com/blog/entry/optimizing-hamlet/ Originally, Hamlet had been based on the text package; the huge slow-down introduced by text convinced me to migrate to bytestrings, and ultimately blaze-html/blaze-builder. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; however, it will be difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using UTF-16 bytestrings would probably overstate the impact since it wouldn't be using Bryan's fusion logic. * Since the prevailing attitude has been such a disregard to any facts shown
thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I'll be chastised by the community, I'll stick with this approach for the moment.
I'm not sure this discussion has surfaced that many facts. What we do have is plenty of theories. I can easily add some more:
* GHC is not doing a good job laying out the branches in the validation code that does arithmetic on the input byte sequence, to validate the input and compute the Unicode code point that should be streamed using fusion.
* The differences in text and bytestring's fusion framework get in the way of some optimization in GHC (text uses a more sophisticated fusion frameworks that handles some cases bytestring can't according to Bryan).
* Lingering space leaks is hurting performance (Bryan plugged one already).
* The use of a polymorphic loop state in the fusion framework gets in the way of unboxing.
* Extraneous copying in the Handle implementation slows down I/O.
All these are plausible reasons why Text might perform worse than ByteString. We need find out why ones are true by benchmarking and looking at the generated Core.
Now if you tell me that text would consider applying a UTF-8 patch, that
would be a different story. But I don't have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking.
I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on.
I think that's the main issue, and one that Duncan nailed on the head: we have to think about what are the important benchmarks. For Hamlet, I need fast UTF-8 bytestring generation. I don't care at all about algorithmic speed for split texts, as an example. My (probably uneducated) guess is that UTF-16 tends to perform many operations in memory faster since almost all characters are represented as 16 bits, while the big benefit for UTF-8 is in reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as I said, that's an (uneducated) guess.
Some people have been floating the idea of multiple text packages. I personally would *not* want to go down that road, but it might be the only approach that allows top performance for all use cases. As is, I'm quite happy using blaze-builder for Hamlet. Michael

On Wed, Aug 18, 2010 at 10:12 AM, Michael Snoyman
While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic:
http://www.snoyman.com/blog/entry/bigtable-benchmarks/ http://www.snoyman.com/blog/entry/optimizing-hamlet/
Even though your benchmark didn't explicitly come up in this thread, Johan and I spent some time improving the performance of Text for it. As a result, in darcs HEAD, Text is faster than String, but slower than ByteString. I'd certainly like to close that gap more aggressively. If the other contributors to this thread took just one minute to craft a benchmark they cared about for every ten minutes they spend producing hot air, we'd be a lot better off.
It could be that these were flaws in text that are correctable and have nothing to do with UTF-16;
Since the internal representation used by text is completely opaque, we could of course change it if necessary, with no user-visible consequences. I've yet to see any data that suggests that it's specifically UTF-16 that is related to any performance shortfalls, however. Some people have been floating the idea of multiple text packages. I
personally would *not* want to go down that road, but it might be the only approach that allows top performance for all use cases.
I'd be surprised if that proves necessary.

On Wed, Aug 18, 2010 at 7:12 PM, Michael Snoyman
On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell
wrote: Sorry, I thought I'd sent these out. While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic:
http://www.snoyman.com/blog/entry/bigtable-benchmarks/ http://www.snoyman.com/blog/entry/optimizing-hamlet/
Originally, Hamlet had been based on the text package; the huge slow-down introduced by text convinced me to migrate to bytestrings, and ultimately blaze-html/blaze-builder. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; however, it will be difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using UTF-16 bytestrings would probably overstate the impact since it wouldn't be using Bryan's fusion logic.
Those are great. As Bryan mentioned we've already improved performance and I think I know how to improve it further. I appreciate that it's difficult to show the UTF-8/UTF-16 divide. I think the approach we're trying at the moment is looking at benchmarks, improving performance, and repeating until we can't improve anymore. It could be the case that we get a benchmark where the performance difference between bytestring and text cannot be explained/fixed by factors other than changing the internal encoding. That would be strong evidence that we should try to switch the internal encoding. We haven't seen any such benchmarks yet. As for blaze I'm not sure exactly how it deals with UTF-8 input. I tried to browse through the repo but could find that input ByteStrings are actually validated anywhere. If they're not it's a big generous to say that it deals with UTF-8 data, as it would really just be concatenating byte sequences, without validating them. We should ask Jasper about the current state.
I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was
faster on some set of benchmarks (starting with the ones already in the library) that we agree on.
I think that's the main issue, and one that Duncan nailed on the head: we have to think about what are the important benchmarks. For Hamlet, I need fast UTF-8 bytestring generation. I don't care at all about algorithmic speed for split texts, as an example. My (probably uneducated) guess is that UTF-16 tends to perform many operations in memory faster since almost all characters are represented as 16 bits, while the big benefit for UTF-8 is in reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as I said, that's an (uneducated) guess.
I agree. Lets create some more benchmarks. For example, lately I've been working on a benchmark, inspired by a real world problem, where I iterate over the lines in a ~500 MBs file, encoded using UTF-8 data, inserting each line in a Data.Map and do a bunch of further processing on it (such as splitting the strings into words). This tests text I/O throughput, memory overhead, performance of string comparison, etc. We already have benchmarks for reading files (in UTF-8) in several different ways (lazy I/O and iteratee style folds). Boil down the things you care about into a self contained benchmark and send it to this list or put it somewhere were we can retrieve it. Cheers, Johan

On Wed, Aug 18, 2010 at 11:58 PM, Johan Tibell
As for blaze I'm not sure exactly how it deals with UTF-8 input. I tried to browse through the repo but could find that input ByteStrings are actually validated anywhere. If they're not it's a big generous to say that it deals with UTF-8 data, as it would really just be concatenating byte sequences, without validating them. We should ask Jasper about the current state.
As far as I can tell, Blaze *never* validates input ByteStrings. The "proper" approach to inserting data into blaze is either via String or Text. I requested that Jasper provide an unsafeByteString function in Blaze for Hamlet's usage: Hamlet does the UTF-8 encoding at compile time and is able to gain a little extra performance boost. If you want to properly validate bytestrings before inputing them, I believe the best approach would be to use utf8-string or text to read in the bytestrings, but Jasper may have a better approach. Michael

Michael Snoyman wrote:
Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data
True, I haven't seen any - except for Google, which I don't believe is accurate. I would like to see some good unbiased data. Right now we just have our intuitions based on anecdotal evidence and whatever years of experience we have in IT. For the anecdotal evidence, I really wish that people from CJK countries were better represented in this discussion. Unfortunately, Haskell is less prevalent in CJK countries, and there is somewhat of a language barrier.
I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.
I agree, I wish we had better numbers.
even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead... As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.
Again, I agree that some real data would be great. The problem is, I'm not sure if there is anyone in this discussion who is qualified to come up with anything even close to a fair random sampling or a CJK website that is representative. As far as I can tell, most of us participating in this discussion have absolutely zero perspective of what computing is like in CJK countries.
As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage.
No, there is a third: using an API that results in robust, readable and maintainable code even in the face of changing encoding requirements. Unless you have proof that the difference in performance between that API and an API with a hard-wired encoding is the factor that is causing your particular application to fail to meet its requirements, the hard-wired approach is guilty of aggravated premature optimization. So for example, UTF-8 is an important option to have in a web toolkit. But if that's the only option, that web toolkit shouldn't be considered a general-purpose one in my opinion.
I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case.
Well, to start with, all MS Word documents are in UTF-16. There are a few of those around I think. Most applications - in some sense of "most" - store text in UTF-16 Again, without any data, my intuition tells me that most of the text data stored in the world's files are in UTF-16. There is currently not much Haskell code that reads those formats directly, but I think that will be changing as usage of Haskell in the real world picks up.
We can't consider a CJK encoding for text,
Not as a default, certainly not as the only option. But nice to have as a choice.
What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8,
In Western countries. Regards, Yitz

On Tue, Aug 17, 2010 at 2:23 PM, Yitzchak Gale
Michael Snoyman wrote:
Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data
True, I haven't seen any - except for Google, which I don't believe is accurate. I would like to see some good unbiased data.
To my knowledge the data we have about prevalence of encoding on the web is accurate. We crawl all pages we can get our hands on, by starting at some set of seeds and then following all the links. You cannot be sure that you've reached all web sites as there might be cliques in the web graph but we try our best to get them all. You're unlikely to get a better estimate anywhere else. I doubt few organizations have the machinery required to crawl most of the web. -- Johan

Johan Tibell wrote:
To my knowledge the data we have about prevalence of encoding on the web is accurate. We crawl all pages we can get our hands on, by starting at some set of seeds and then following all the links. You cannot be sure that you've reached all web sites as there might be cliques in the web graph but we try our best to get them all. You're unlikely to get a better estimate anywhere else. I doubt few organizations have the machinery required to crawl most of the web.
There was a study recently on this. They found that there are four main parts of the Internet: * a densely connected core, where from any site you can get to any other * an "in cone", from which you can reach the core (but not other in-cone members, since then you'd both be in the core) * an "out cone", which can be reached from the core (but which cannot reach each other) * and, unconnected islands The surprising part is they found that all four parts are approximately the same size. I forget the exact numbers, but they're all 25+/-5%. This implies that an exhaustive crawl of the web would require having about 50% of all websites as seeds (the in-cone plus the islands). If we're only interested in a representative sample, then we could get by with fewer. However, that depends a lot on the definition of "representative". And we can't have an accurate definition of representative without doing the entire crawl at some point in order to discover the appropriate distributions. Then again, distributions change over time... Thus, I would guess that Google only has 50~75% of the net: the core, the out-cone, and a fraction of the islands and in-cone. -- Live well, ~wren

On 18 August 2010 12:12, wren ng thornton
Johan Tibell wrote:
To my knowledge the data we have about prevalence of encoding on the web is accurate. We crawl all pages we can get our hands on, by starting at some set of seeds and then following all the links. You cannot be sure that you've reached all web sites as there might be cliques in the web graph but we try our best to get them all. You're unlikely to get a better estimate anywhere else. I doubt few organizations have the machinery required to crawl most of the web.
There was a study recently on this. They found that there are four main parts of the Internet:
* a densely connected core, where from any site you can get to any other * an "in cone", from which you can reach the core (but not other in-cone members, since then you'd both be in the core) * an "out cone", which can be reached from the core (but which cannot reach each other) * and, unconnected islands
I'm guessing here that you're referring to what I've heard called the "hidden web": databases, etc. that require sign-ins, etc. (as stuff that isn't in the core, to differing degrees: some of these databases are indexed by google but you can't actually read them without an account, etc.) ? -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Ivan Lazar Miljenovic wrote:
On 18 August 2010 12:12, wren ng thornton
wrote: To my knowledge the data we have about prevalence of encoding on the web is accurate. We crawl all pages we can get our hands on, by starting at some set of seeds and then following all the links. You cannot be sure that you've reached all web sites as there might be cliques in the web graph but we try our best to get them all. You're unlikely to get a better estimate anywhere else. I doubt few organizations have the machinery required to crawl most of the web. There was a study recently on this. They found that there are four main
Johan Tibell wrote: parts of the Internet:
* a densely connected core, where from any site you can get to any other * an "in cone", from which you can reach the core (but not other in-cone members, since then you'd both be in the core) * an "out cone", which can be reached from the core (but which cannot reach each other) * and, unconnected islands
I'm guessing here that you're referring to what I've heard called the "hidden web": databases, etc. that require sign-ins, etc. (as stuff that isn't in the core, to differing degrees: some of these databases are indexed by google but you can't actually read them without an account, etc.) ?
Not so far as I recall. I'd have to find a copy of the paper to be sure though. Because the metric used was graph connectivity, if those hidden pages have links out into non-hidden pages (e.g., the login page), then they'd be counted in the same way as the non-hidden pages reachable from them. -- Live well, ~wren

On Wed, Aug 18, 2010 at 4:12 AM, wren ng thornton
There was a study recently on this. They found that there are four main parts of the Internet:
* a densely connected core, where from any site you can get to any other * an "in cone", from which you can reach the core (but not other in-cone members, since then you'd both be in the core) * an "out cone", which can be reached from the core (but which cannot reach each other) * and, unconnected islands
The surprising part is they found that all four parts are approximately the same size. I forget the exact numbers, but they're all 25+/-5%.
This implies that an exhaustive crawl of the web would require having about 50% of all websites as seeds (the in-cone plus the islands). If we're only interested in a representative sample, then we could get by with fewer. However, that depends a lot on the definition of "representative". And we can't have an accurate definition of representative without doing the entire crawl at some point in order to discover the appropriate distributions. Then again, distributions change over time...
Thus, I would guess that Google only has 50~75% of the net: the core, the out-cone, and a fraction of the islands and in-cone.
That's an interesting result. However, if you weigh each page with its page views you'll probably find that Google (and other search engines) probably cover much more than that since page views on sites tend to follow a power-law distribution. -- Johan

On Tue, Aug 17, 2010 at 3:23 PM, Yitzchak Gale
Michael Snoyman wrote:
Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data
True, I haven't seen any - except for Google, which I don't believe is accurate. I would like to see some good unbiased data.
Right now we just have our intuitions based on anecdotal evidence and whatever years of experience we have in IT.
For the anecdotal evidence, I really wish that people from CJK countries were better represented in this discussion. Unfortunately, Haskell is less prevalent in CJK countries, and there is somewhat of a language barrier.
I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.
I agree, I wish we had better numbers.
even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead... As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.
Again, I agree that some real data would be great.
The problem is, I'm not sure if there is anyone in this discussion who is qualified to come up with anything even close to a fair random sampling or a CJK website that is representative. As far as I can tell, most of us participating in this discussion have absolutely zero perspective of what computing is like in CJK countries.
I won't call this a scientific study by any stretch of the imagination, but I did a quick test on the www.qq.com homepage. The original file encoding was GB2312; here are the file sizes:
GB2312: 193014 UTF8: 200044 UTF16: 371938
As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage.
No, there is a third: using an API that results in robust, readable and maintainable code even in the face of changing encoding requirements. Unless you have proof that the difference in performance between that API and an API with a hard-wired encoding is the factor that is causing your particular application to fail to meet its requirements, the hard-wired approach is guilty of aggravated premature optimization.
So for example, UTF-8 is an important option to have in a web toolkit. But if that's the only option, that web toolkit shouldn't be considered a general-purpose one in my opinion.
I'm not talking about API changes here; the topic at hand is the internal representation of the stream of characters used by the text package. That is currently UTF-16; I would argue switching to UTF8.
I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case.
Well, to start with, all MS Word documents are in UTF-16. There are a few of those around I think. Most applications - in some sense of "most" - store text in UTF-16
Again, without any data, my intuition tells me that most of the text data stored in the world's files are in UTF-16. There is currently not much Haskell code that reads those formats directly, but I think that will be changing as usage of Haskell in the real world picks up.
I was referring to text files, not binary files with text embedded within them. While we might use the text package to deal with the data from a Word doc once in memory, we would almost certainly need to use ByteString (or binary perhaps) to actually parse the file. But at the end of the day, you're right: there would be an encoding penalty at a certain point, just not on the entire file.
We can't consider a CJK encoding for text,
Not as a default, certainly not as the only option. But nice to have as a choice.
I think you're missing the point at hand: I don't think *any* is opposed to offering encoders/decoders for all the multitude of encoding types out there. In fact, I believe the text-icu package already supports every encoding type under discussion. The question is the internal representation for text, for which a language-specific encoding is *not* a choice, since it does not support all unicode code points.
Michael

On Tue, Aug 17, 2010 at 06:12, Michael Snoyman
I'm not talking about API changes here; the topic at hand is the internal representation of the stream of characters used by the text package. That is currently UTF-16; I would argue switching to UTF8.
The Data.Text.Foreign module is part of the API, and is currently hardcoded to use UTF-16. Any change of the internal encoding will require breaking this module's API.
We can't consider a CJK encoding for text,
Not as a default, certainly not as the only option. But nice to have as a choice.
I think you're missing the point at hand: I don't think *any* is opposed to offering encoders/decoders for all the multitude of encoding types out there. In fact, I believe the text-icu package already supports every encoding type under discussion. The question is the internal representation for text, for which a language-specific encoding is *not* a choice, since it does not support all unicode code points. Michael
The reason many Japanese and Chinese users reject UTF-8 isn't due to space constraints (UTF-8 and UTF-16 are roughly equal), it's because they reject Unicode itself. Shift-JIS and the various Chinese encodings both contain Han characters which are missing from Unicode, either due to the Han unification or simply were not considered important enough to include (yet there's a codepage for Linear-B...). Ruby, which has an enormous Japanese userbase, solved the problem by essentially defining Text = (Encoding, ByteString), and then re-implementing text logic for each encoding. This allows very efficient operation with every possible encoding, at the cost of increased complexity (caching decoded characters, multi-byte handling, etc).

On Tue, Aug 17, 2010 at 6:19 PM, John Millikin
Ruby, which has an enormous Japanese userbase, solved the problem by essentially defining Text = (Encoding, ByteString), and then re-implementing text logic for each encoding. This allows very efficient operation with every possible encoding, at the cost of increased complexity (caching decoded characters, multi-byte handling, etc).
This code introduce overhead as each function call needs to dispatch on the encoding, which is unlikely to be known statically. I don't know if this matters or not (yet another thing that needs to be measured). -- Johan

Quoth John Millikin
Ruby, which has an enormous Japanese userbase, solved the problem by essentially defining Text = (Encoding, ByteString), and then re-implementing text logic for each encoding. This allows very efficient operation with every possible encoding, at the cost of increased complexity (caching decoded characters, multi-byte handling, etc).
Ruby actually comes from the CJK world in a way, doesn't it? Even if efficient per-encoding manipulation is a tough nut to crack, it at least avoids the fixed cost of bulk decoding, so an application designer doesn't need to think about the pay-off for a correct text approach vs. `binary'/ASCII, and the language/library designer doesn't need to think about whether genome data is a representative case etc. If Haskell had the development resources to make something like this work, would it actually take the form of a Haskell-level type like that - data Text = (Encoding, ByteString)? I mean, I know that's just a very clear and convenient way to express it for the purposes of the present discussion, and actual design is a little premature - ... but, I think you could argue that from the Haskell level, `Text' should be a single type, if the encoding differences aren't semantically interesting. Donn Cave, donn@avvanta.com

On Tue, Aug 17, 2010 at 9:30 PM, Donn Cave
Quoth John Millikin
, Ruby, which has an enormous Japanese userbase, solved the problem by essentially defining Text = (Encoding, ByteString), and then re-implementing text logic for each encoding. This allows very efficient operation with every possible encoding, at the cost of increased complexity (caching decoded characters, multi-byte handling, etc).
Ruby actually comes from the CJK world in a way, doesn't it?
Even if efficient per-encoding manipulation is a tough nut to crack, it at least avoids the fixed cost of bulk decoding, so an application designer doesn't need to think about the pay-off for a correct text approach vs. `binary'/ASCII, and the language/library designer doesn't need to think about whether genome data is a representative case etc.
Remember that the cost of decoding is O(n) no matter what encoding is used internally as you always have to validate when going from ByteString to Text. If the external and internal encoding don't match then you also have to copy the bytes into a new buffer, but that is only one allocation (a pointer increment with a semi-space collector) and the copy is cheap since the data is in cache. -- Johan

On Tue, Aug 17, 2010 at 12:30, Donn Cave
If Haskell had the development resources to make something like this work, would it actually take the form of a Haskell-level type like that - data Text = (Encoding, ByteString)? I mean, I know that's just a very clear and convenient way to express it for the purposes of the present discussion, and actual design is a little premature - ... but, I think you could argue that from the Haskell level, `Text' should be a single type, if the encoding differences aren't semantically interesting.
It should be possible to create a Ruby-style Text in Haskell, using the existing Text API. The constructor would be something like << data Text = Text !Encoding !ByteString >>, but there's no need to export it. The only significant improvements, performance-wise, would be that 1) "encoding" text to its internal encoding would be O(1) and 2) "decoding" text would only have to perform validation, instead of validation+copy+stream fusion muck. Downside: lazy decoding makes it very difficult to reason about failures, since even simple operations like 'append' might fail if you try to append two texts with mutually-incompatible characters. In any case, I suspect getting Haskell itself to support non-Unicode characters is much more difficult than writing an appropriate Text type.

John Millikin wrote:
The reason many Japanese and Chinese users reject UTF-8 isn't due to space constraints (UTF-8 and UTF-16 are roughly equal), it's because they reject Unicode itself.
+1. This is the thing Unicode advocates don't want to admit. Until Unicode has code points for _all_ Chinese and Japanese characters, there will be active resistance to adoption. -- Live well, ~wren

John Millikin wrote:
The reason many Japanese and Chinese users reject UTF-8 isn't due to space constraints (UTF-8 and UTF-16 are roughly equal), it's because they reject Unicode itself.
+1.
This is the thing Unicode advocates don't want to admit. Until Unicode has code points for _all_ Chinese and Japanese characters, there will be active resistance to adoption.
-- Live well, ~wren
For mainland chinese websites: Most that became popular during web 1.0 (5-10 years ago) are using utf-8 incompatible format, e.g. gb2312. for example: * www.sina.com.cn * www.sohu.com They didn't switch to utf-8 probably just because they never have to. However, many of the popular websites started during web 2.0 are adopting utf-8 for example: * renren.com (chinese largest facebook clone) * www.kaixin001.com (chinese second largest facebook clone) * t.sina.com.cn (an example of twitter clone) These websites adopted utf-8 because (I think) most web development tools have already standardized on utf-8, and there's little reason change it. I'm not aware of any (at least common) chinese characters that can be represented by gb2312 but not in unicode. Since the range of gb2312 is a subset of the range of gbk, which is a subset of the range of gb18030. And gb18030 is just another encoding of unicode. ref: * http://en.wikipedia.org/wiki/GB_18030 -- jinjing

Jinjing Wang wrote:
John Millikin wrote:
The reason many Japanese and Chinese users reject UTF-8 isn't due to space constraints (UTF-8 and UTF-16 are roughly equal), it's because they reject Unicode itself. +1.
This is the thing Unicode advocates don't want to admit. Until Unicode has code points for _all_ Chinese and Japanese characters, there will be active resistance to adoption.
[...] However, many of the popular websites started during web 2.0 are adopting utf-8
for example:
* renren.com (chinese largest facebook clone) * www.kaixin001.com (chinese second largest facebook clone) * t.sina.com.cn (an example of twitter clone)
These websites adopted utf-8 because (I think) most web development tools have already standardized on utf-8, and there's little reason change it.
Interesting. I don't know much about the politics of Chinese encodings, other than that the GB formats are/were dominant. As for the politics of Japanese encodings, last time I did web work (just at the beginning of web2.0, before they started calling it that) there was still a lot of active resistance among the Japanese. Given some of the characters folks were complaining about, I think it's more an issue of principle than practicality. Then again, the Japanese do love their language games, so obscure and archaic characters are used far more often than would be expected... Whether web2.0 has caused the Japanese to change too, I can't say. I got out of that line of work ^_^
I'm not aware of any (at least common) chinese characters that can be represented by gb2312 but not in unicode. Since the range of gb2312 is a subset of the range of gbk, which is a subset of the range of gb18030. And gb18030 is just another encoding of unicode.
All the specific characters I've seen folks complain about were very uncommon or even archaic. All the common characters are there for Japanese too. The only time I've run into issues it was for an archaic character used in a manga title. I was working on a library catalog, and was too pedantic to spell it "wrong". -- Live well, ~wren

John Millikin
The reason many Japanese and Chinese users reject UTF-8 isn't due to space constraints (UTF-8 and UTF-16 are roughly equal), it's because they reject Unicode itself.
Probably because they don't think it's complicated enough¹?
Shift-JIS and the various Chinese encodings both contain Han characters which are missing from Unicode, either due to the Han unification or simply were not considered important enough to include
Surely there's enough space left? I seem to remember some Han characters outside of the BMP, so I would have guessed this is an argument from back in the UCS-2 days. (BTW, on a long train ride, I brought the linear-B alphabet, and practiced writing notes to my kids. So linear-B isn't entirely useless :-)
From casual browsing of Wikipedia, the current status in CJK-land seems to be something like this:
China: GB2312 and its successor GB18030 Taiwan, Macao, and Hong Kong: Big5 Japan: Shift-JIS Korea: EUC-KR It is interesting that some of these provide a lot fewer characters than Unicode. Another feature of several of them is that ASCII and e.g. kana scripts take up one byte, and ideograms take up two, which correlates with the expected width of the glyphs. Several of the pages indicate that Unicode, and mainly UTF-8, is gradually taking over. -k ¹ Those who remember Emacs in the MULE days will know what I mean. -- If I haven't seen further, it is by standing in the footprints of giants

Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the
wikipedia for Chinese.
-Andrew
On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman
On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale
wrote: Ketil Malde wrote:
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language.
I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8.
Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias.
In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future.
I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.
As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8.
As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.
Michael
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Well, I'm not certain if it counts as a typical Chinese website, but here
are the stats;
UTF8: 64,198
UTF16: 113,160
And just for fun, after gziping:
UTF8: 17,708
UTF16: 19,367
On Wed, Aug 18, 2010 at 2:59 AM, anderson leo
Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese.
-Andrew
On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman
wrote: On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale
wrote: Ketil Malde wrote:
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language.
I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8.
Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias.
In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future.
I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.
As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8.
As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.
Michael
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

More typical Chinese web sites:
www.ifeng.com (web site likes nytimes)
dzh.mop.com (community for fun)
www.csdn.net (web site for IT)
www.sohu.com (web site like yahoo)
www.sina.com (web site like yahoo)
-- Andrew
On Wed, Aug 18, 2010 at 11:40 AM, Michael Snoyman
Well, I'm not certain if it counts as a typical Chinese website, but here are the stats;
UTF8: 64,198 UTF16: 113,160
And just for fun, after gziping:
UTF8: 17,708 UTF16: 19,367
On Wed, Aug 18, 2010 at 2:59 AM, anderson leo
wrote: Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese.
-Andrew
On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman
wrote: On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale
wrote: Ketil Malde wrote:
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language.
I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8.
Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias.
In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future.
I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.
As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8.
As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.
Michael
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Alright, here's the results for the first three in the list (please forgive
me for being lazy- I am a Haskell programmer after all):
ifeng.com:
UTF8: 299949
UTF16: 566610
dzh.mop.com:
GBK: 1866
UTF8: 1891
UTF16: 3684
www.csdn.net:
UTF8: 122870
UTF16: 217420
Seems like UTF8 is a consistent winner versus UTF16, and not much of a loser
to the native formats.
Michael
On Wed, Aug 18, 2010 at 11:01 AM, anderson leo
More typical Chinese web sites: www.ifeng.com (web site likes nytimes) dzh.mop.com (community for fun) www.csdn.net (web site for IT) www.sohu.com (web site like yahoo) www.sina.com (web site like yahoo)
-- Andrew
On Wed, Aug 18, 2010 at 11:40 AM, Michael Snoyman
wrote: Well, I'm not certain if it counts as a typical Chinese website, but here are the stats;
UTF8: 64,198 UTF16: 113,160
And just for fun, after gziping:
UTF8: 17,708 UTF16: 19,367
On Wed, Aug 18, 2010 at 2:59 AM, anderson leo
wrote: Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese.
-Andrew
On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman
wrote: On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale
wrote: Ketil Malde wrote:
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8...
I don't think the genome is typical text. And I doubt that is true if that text is in a CJK language.
I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8.
Given the growth rate of China's economy, if CJK isn't already the majority of text being processed in the world, it will be soon. I have seen media reports claiming CJK is now a majority of text data going over the wire on the web, though I haven't seen anything scientific backing up those claims. It certainly seems reasonable. I believe Google's measurements based on their own web index showing wide adoption of UTF-8 are very badly skewed due to a strong Western bias.
In that case, if we have to pick one encoding for Data.Text, UTF-16 is likely to be a better choice than UTF-8, especially if the cost is fairly low even for the special case of Western languages. Also, UTF-16 has become by far the dominant internal text format for most software and for most user platforms. Except on desktop Linux - and whether we like it or not, Linux desktops will remain a tiny minority for the foreseeable future.
I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.
As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8.
As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.
Michael
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Yitzchak Gale
I don't think the genome is typical text.
I think the typical *large* collection of text is text-encoded data, and not, for lack of a better word, literature. Genomics data is just an example. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Aug 17, 2010, at 11:51 PM, Ketil Malde wrote:
Yitzchak Gale
writes: I don't think the genome is typical text.
I think the typical *large* collection of text is text-encoded data, and not, for lack of a better word, literature. Genomics data is just an example.
I have a collection of 100,000 patents I'm working with. 5.5GB of XML, most of it (US-)English text. After stripping out the XML markup, it's 4GB of text. It's a random sample from some 14 million patents I could have access to, but 100,000 was more than enough.

Hi Ketil,
On Tue, Aug 17, 2010 at 12:09 PM, Ketil Malde
Johan Tibell
writes: It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower.
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8. Many applications will get away with streaming over data, retaining only a small part, but some won't.
I'm not sure if this is a great example as genome data is probably much better stored in a vector (using a few bits per "letter"). I agree that whenever one data structure will fit in the available RAM and another won't the smaller will win. I just don't know if this case is worth spending weeks worth of work optimizing for. That's why I'd like to see benchmarks for more idiomatic use cases.
In other cases (e.g. processing CJK text, and perhap also non-Latin1 text), I'm sure it'll be faster - but my (still unsubstantiated) guess is that the difference will be much smaller, and it'll be a case of winning some and losing some - and I'd also conjecture that having 3Gb "real" text (i.e. natural language, as opposed to text-formatted data) is rare.
I would like to verify this guess. In my personal experience it's really hard to guess which changes will lead to a noticeable performance improvement. I'm probably wrong more often than I'm right. Cheers, Johan

Bryan O'Sullivan wrote:
As a case in point, I took the string search benchmark that Daniel shared on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file?
GNU wc -m:
- en_US.UTF-8: 0.701s
text 0.7.1.0:
- lazy text: 1.959s - strict text: 3.527s
darcs HEAD:
- lazy text: 0.749s - strict text: 0.927s
When should we expect to see the HEAD stamped and numbered? After some of the recent benchmark dueling re web frameworks, I know Text got a bad rap compared to ByteString. It'd be good to stop the FUD early. Repeating the above in the announcement should help a lot. -- Live well, ~wren

wren:
Bryan O'Sullivan wrote:
As a case in point, I took the string search benchmark that Daniel shared on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file?
GNU wc -m:
- en_US.UTF-8: 0.701s
text 0.7.1.0:
- lazy text: 1.959s - strict text: 3.527s
darcs HEAD:
- lazy text: 0.749s - strict text: 0.927s
When should we expect to see the HEAD stamped and numbered? After some of the recent benchmark dueling re web frameworks, I know Text got a bad rap compared to ByteString. It'd be good to stop the FUD early. Repeating the above in the announcement should help a lot.
For what its worth, for several bytestring announcements I published comprehensive function-by-function comparisions of performance on enormous data sets, until there was unambiguous evidence bytestring was faster than List. E.g http://www.mail-archive.com/haskell@haskell.org/msg18596.html

Hello Bryan, Sunday, August 15, 2010, 10:04:01 PM, you wrote:
shared on Friday, and boiled it down to a simple test case: how long does it take to read a 31MB file? GNU wc -m:
there are even slower ways to do it if you need :) if your data aren't cached, then speed is limited by HDD. if your data are cached, it should be 20-50x faster. try cat >nul -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bryan O'Sullivan wrote:
In general, Unicode uptake is increasing rapidly: http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html
These Google graphs are the oft-quoted source of Unicode's growing dominance. But the data for those graphs is taken from Google's own web indexing. Google is a U.S. company that appears to have a strong Western culture bias - viz. their recent high-profile struggles with China. Google is far from being the dominant market leader in CJK countries that they are in Western countries. Their level of understanding of those markets is clearly not the same. It could be this really is true for CJK countries as well, or it could be that the data is skewed by Google's web indexing methods. I won't believe that source until it is highly corroborated with data and opinions that are native to CJK countries, from sources that do not have a vested interest in Unicode adoption. What we have heard in the past from members of our own community in CJK countries does not agree at all with Google's claims, but that may be changing. It would be great to hear more from them. Regards, Yitz

On Sat, Aug 14, 2010 at 5:39 PM, Yitzchak Gale
It could be this really is true for CJK countries as well, or it could be that the data is skewed by Google's web indexing methods.
I also wouldn't be surprised if the picture for web-based text is quite different from that for other textual data.

Yitzchak Gale wrote:
Bryan O'Sullivan wrote:
In general, Unicode uptake is increasing rapidly: http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html
These Google graphs are the oft-quoted source of Unicode's growing dominance. But the data for those graphs is taken from Google's own web indexing.
Note also that all those encodings near the bottom are remaining relatively constant. UTF8 is taking its market share from ASCII and Western European encodings, not so much from other encodings (as yet). As Bryan mentioned, Unicode doesn't have wide acceptance in CJK countries. These days, Japanese websites seem to have finally started to standardize--- in that they use HTTP/HTML headers to say which encoding the pages are in (and generally use JIS or Shift-JIS). This is a big step up from a decade ago when non-commercial sites pretty invariably required fiddling with the browser to get rid of mojibake. Japan hasn't been bitten by the i18n/l10n bug and they don't have a strong F/OSS community to drive adoption either. -- Live well, ~wren

Johan Tibell
Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text.
If you have a large amount of mostly ASCII text, use ByteString, since Data.Text uses twice the storage. Also, ByteString might make more sense if the data is in a byte-oriented encoding, and the cost of encoding and decoding utf-16 would be significant. -k -- If I haven't seen further, it is by standing in the footprints of giants

I find it disturbing that a modern programming language like Haskell
still apparently forces you to choose between a representation for
"mostly ASCII text" and Unicode.
Surely efficient Unicode text should always be the default? And if the
Unicode format used by the Text library is not efficient enough then
can't that be fixed?
Cheers,
Kevin
On Aug 13, 10:28 pm, Ketil Malde
Johan Tibell
writes: Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text.
If you have a large amount of mostly ASCII text, use ByteString, since Data.Text uses twice the storage. Also, ByteString might make more sense if the data is in a byte-oriented encoding, and the cost of encoding and decoding utf-16 would be significant.
-k -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe

There are many libraries for many purposes. How to pick your string library in Haskell http://blog.ezyang.com/2010/08/strings-in-haskell/ kevinjardine:
I find it disturbing that a modern programming language like Haskell still apparently forces you to choose between a representation for "mostly ASCII text" and Unicode.
Surely efficient Unicode text should always be the default? And if the Unicode format used by the Text library is not efficient enough then can't that be fixed?
Cheers, Kevin
On Aug 13, 10:28 pm, Ketil Malde
wrote: Johan Tibell
writes: Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text.
If you have a large amount of mostly ASCII text, use ByteString, since Data.Text uses twice the storage. Also, ByteString might make more sense if the data is in a byte-oriented encoding, and the cost of encoding and decoding utf-16 would be significant.
-k -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi Don,
With respect, I disagree with that approach.
Almost every modern programming language has one or at most two
standard representations for strings.
That includes PHP, Python, Ruby, Perl and many others. The lack of a
standard text representation in Haskell has created a crazy patchwork
of incompatible libraries requiring explicit and often inefficient
conversions to connect them together.
I expect Haskell to be higher level than those other languages so that
I can ignore the lower level details and focus on the algorithms. But
in fact the string issue forces me to deal with lower level details
than even PHP requires. I end up with a program littered with ugly
pack, unpack, toString, fromString and similar calls.
That just doesn't feel right to me.
Kevin
On Aug 13, 10:39 pm, Don Stewart
There are many libraries for many purposes.
How to pick your string library in Haskell http://blog.ezyang.com/2010/08/strings-in-haskell/
kevinjardine:
I find it disturbing that a modern programming language like Haskell still apparently forces you to choose between a representation for "mostly ASCII text" and Unicode.
Surely efficient Unicode text should always be the default? And if the Unicode format used by the Text library is not efficient enough then can't that be fixed?
Cheers, Kevin
On Aug 13, 10:28�pm, Ketil Malde
wrote: Johan Tibell
writes: Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text.
If you have a large amount of mostly ASCII text, use ByteString, since Data.Text uses twice the storage. �Also, ByteString might make more sense if the data is in a byte-oriented encoding, and the cost of encoding and decoding utf-16 would be significant.
-k -- If I haven't seen further, it is by standing in the footprints of giants _______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe
Haskell-Cafe mailing list Haskell-C...@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-C...@haskell.orghttp://www.haskell.org/mailman/listinfo/haskell-cafe

Kevin Jardine
Almost every modern programming language has one or at most two standard representations for strings.
That includes PHP, Python, Ruby, Perl and many others. The lack of a standard text representation in Haskell has created a crazy patchwork of incompatible libraries requiring explicit and often inefficient conversions to connect them together.
Haskell does have a standard representation for strings, namely [Char]. Unfortunately, this sacrifices efficiency for elegance, which gives rise to the plethora of libraries.
I end up with a program littered with ugly pack, unpack, toString, fromString and similar calls.
Some of this can be avoided using a language extension that let you overload string constants. There are always trade offs, and no one solution will fit all: UTF-8 is space efficient while UTF-16 is time efficient (at least for certain classes of problems and data). It does seem that it should be possible to unify the various libraries wrapping bytestrings (CompactString, ByteString.UTF8 etc), however. -k -- If I haven't seen further, it is by standing in the footprints of giants

Ketil Malde wrote:
Haskell does have a standard representation for strings, namely [Char]. Unfortunately, this sacrifices efficiency for elegance, which gives rise to the plethora of libraries.
To have the default standard representation be one that works so poorly for many common everyday tasks such as mangling large chunks of XML is a large part of the problem. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Kevin Jardine wrote:
With respect, I disagree with that approach.
Almost every modern programming language has one or at most two standard representations for strings.
I think having two makes sense, one for arrays of arbitrary binary bytes and one for some unicode data format, preferably UTF-8.
That includes PHP, Python, Ruby, Perl and many others. The lack of a standard text representation in Haskell has created a crazy patchwork of incompatible libraries requiring explicit and often inefficient conversions to connect them together.
I expect Haskell to be higher level than those other languages so that I can ignore the lower level details and focus on the algorithms. But in fact the string issue forces me to deal with lower level details than even PHP requires. I end up with a program littered with ugly pack, unpack, toString, fromString and similar calls.
That just doesn't feel right to me.
That is what I was trying to say whenI started this thread. Thank you. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Kevin Jardine
Hi Don,
With respect, I disagree with that approach.
Almost every modern programming language has one or at most two standard representations for strings.
Almost every modern programming language thinks you can whack a print statement wherever you like... ;-)
That includes PHP, Python, Ruby, Perl and many others. The lack of a standard text representation in Haskell has created a crazy patchwork of incompatible libraries requiring explicit and often inefficient conversions to connect them together.
I expect Haskell to be higher level than those other languages so that I can ignore the lower level details and focus on the algorithms. But in fact the string issue forces me to deal with lower level details than even PHP requires. I end up with a program littered with ugly pack, unpack, toString, fromString and similar calls.
So, the real issue here is that there is not yet a good abstraction over what we consider to be textual data, and instead people have to code to a specific data type. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Fri, Aug 13, 2010 at 4:03 PM, Ivan Lazar Miljenovic < ivan.miljenovic@gmail.com> wrote:
Kevin Jardine
writes: Hi Don,
With respect, I disagree with that approach.
Almost every modern programming language has one or at most two standard representations for strings.
Almost every modern programming language thinks you can whack a print statement wherever you like... ;-)
That includes PHP, Python, Ruby, Perl and many others. The lack of a standard text representation in Haskell has created a crazy patchwork of incompatible libraries requiring explicit and often inefficient conversions to connect them together.
I expect Haskell to be higher level than those other languages so that I can ignore the lower level details and focus on the algorithms. But in fact the string issue forces me to deal with lower level details than even PHP requires. I end up with a program littered with ugly pack, unpack, toString, fromString and similar calls.
So, the real issue here is that there is not yet a good abstraction over what we consider to be textual data, and instead people have to code to a specific data type.
Isn't this the same problem we have with numeric literals? I might even go so far as to suggest it's going to be a problem with all types of literals. Isn't it also a problem which is partially solved with the OverloadedStrings extension? http://haskell.cs.yale.edu/ghc/docs/6.12.2/html/users_guide/type-class-exten... It seems like the interface exposed by ByteString could be in a type class. At that point, would the problem be solved? Jason

Jason Dagit
On Fri, Aug 13, 2010 at 4:03 PM, Ivan Lazar Miljenovic <
So, the real issue here is that there is not yet a good abstraction over what we consider to be textual data, and instead people have to code to a specific data type.
Isn't this the same problem we have with numeric literals? I might even go so far as to suggest it's going to be a problem with all types of literals.
Not just literals; there is no common way of doing a character replacement (e.g. map toUpper) in a textual type for example.
Isn't it also a problem which is partially solved with the OverloadedStrings extension? http://haskell.cs.yale.edu/ghc/docs/6.12.2/html/users_guide/type-class-exten...
That just convert literals; it doesn't provide a common API.
It seems like the interface exposed by ByteString could be in a type class. At that point, would the problem be solved?
To a certain extent, yes. There is no one typeclass that could cover everything (especially since something as simple as toUpper won't work if I understand Bryan's ß -> SS example), but it would help in the majority of cases. There has been one attempt, but it doesn't seem very popular (tagsoup has another, but it's meant to be internal only): http://hackage.haskell.org/packages/archive/ListLike/latest/doc/html/Data-Li... -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Excerpts from Kevin Jardine's message of Fri Aug 13 16:37:14 -0400 2010:
I find it disturbing that a modern programming language like Haskell still apparently forces you to choose between a representation for "mostly ASCII text" and Unicode.
Surely efficient Unicode text should always be the default? And if the Unicode format used by the Text library is not efficient enough then can't that be fixed?
For what it's worth, Java uses UTF-16 representation internally for strings, and thus also wastes space. There is something to be said for UTF-8 in-memory representation, but it takes a lot of care. A newtype for dirty and clean UTF-8 may come in handy. Cheers, Edward

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 8/13/10 16:37 , Kevin Jardine wrote:
Surely efficient Unicode text should always be the default? And if the
Efficient for what? The most efficient Unicode representation for Latin-derived strings is UTF-8, but the most efficient for CJK is UTF-16. - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxl5iUACgkQIn7hlCsL25VxzQCgl0lKLIPQwygh/LlUbCq3v2bv VOcAnR/xJfYBIa1NbNp5VcNk2TlZb1mn =b9YK -----END PGP SIGNATURE-----

On Fri, Aug 13, 2010 at 6:41 PM, Brandon S Allbery KF8NH
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 8/13/10 16:37 , Kevin Jardine wrote:
Surely efficient Unicode text should always be the default? And if the
Efficient for what? The most efficient Unicode representation for Latin-derived strings is UTF-8, but the most efficient for CJK is UTF-16.
I have an app that is using Data.Text, however I'm thinking of switching to UTF8 bytestrings. The reasons are that there are two main things I do with text: pass it to a C API to display, and parse it. The C API expects UTF8, and the parser libraries with a reputation for being fast all seem to have bytestring inputs, but not Data.Text (I'm using unpack -> parsec, which is not optimal).

On Friday 13 August 2010 8:51:46 pm Evan Laforge wrote:
I have an app that is using Data.Text, however I'm thinking of switching to UTF8 bytestrings. The reasons are that there are two main things I do with text: pass it to a C API to display, and parse it. The C API expects UTF8, and the parser libraries with a reputation for being fast all seem to have bytestring inputs, but not Data.Text (I'm using unpack -> parsec, which is not optimal).
You should be able to use parsec with text. All you need to do is write a Stream instance: instance Monad m => Stream Text m Char where uncons = return . Text.uncons -- Dan

On Fri, Aug 13, 2010 at 10:01 PM, Dan Doel
On Friday 13 August 2010 8:51:46 pm Evan Laforge wrote:
I have an app that is using Data.Text, however I'm thinking of switching to UTF8 bytestrings. The reasons are that there are two main things I do with text: pass it to a C API to display, and parse it. The C API expects UTF8, and the parser libraries with a reputation for being fast all seem to have bytestring inputs, but not Data.Text (I'm using unpack -> parsec, which is not optimal).
You should be able to use parsec with text. All you need to do is write a Stream instance:
instance Monad m => Stream Text m Char where uncons = return . Text.uncons
Then this should be on a 'parsec-text' package. Instances are always implicitly imported. Suppose packages A and B define this instance separately. If package C imports A and B, then it can't use any of those instances nor define its own. Cheers! =) -- Felipe.

On Aug 14, 2:41 am, Brandon S Allbery KF8NH
Efficient for what? The most efficient Unicode representation for Latin-derived strings is UTF-8, but the most efficient for CJK is UTF-16.
I think that this kind of programming detail should be handled internally (even if necessary by switching automatically from UTF-8 to UTF-16 depending upon the language). I'm using Haskell so that I can write high level code. In my view I should not have to care if the people using my application write in Farsi, Quechua or Tamil. Kevin

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 8/14/10 01:29 , Kevin Jardine wrote:
I think that this kind of programming detail should be handled internally (even if necessary by switching automatically from UTF-8 to UTF-16 depending upon the language).
This is going to carry a heavy speed penalty.
I'm using Haskell so that I can write high level code. In my view I should not have to care if the people using my application write in Farsi, Quechua or Tamil.
Ideally yes, but arguably the existing Unicode representations don't allow this to be done nicely. (Of course, arguably there is no "nice" way to do it; UTF-16 is the best you can do as a workable generic setting.) - -- brandon s. allbery [linux,solaris,freebsd,perl] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.10 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkxmqaEACgkQIn7hlCsL25WmOQCfYEjkem99o5IpwxnD7bNaDYyG 768AoK17I605DqDxIdnFUE7MK2ktMtrN =lOPK -----END PGP SIGNATURE-----

Quoth Brandon S Allbery KF8NH
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 8/14/10 01:29 , Kevin Jardine wrote:
I think that this kind of programming detail should be handled internally (even if necessary by switching automatically from UTF-8 to UTF-16 depending upon the language).
It seems like the right thing, described in the wrong words - wouldn't it be a more sensible ideal, to simply `switch' depending on the character encoding? I mean, to start with, you'd surely wish for some standardization, so that the difference between UTF-8 and UTF-16 is essentially internal, while you use the same API indifferently. Second, a key requirement to effectively work with external data is support for multiple character encodings. E.g., if Text is internally UTF-16, it still must be able to input and output UTF-8, and presumably also UTF-16 where appropriate. So given full support for _both_ encodings (for example, Text implementation for `native' UTF-8), and support for input data of _either_ encoding as encountered at run time ... then the internal implementation choice should simply follow the external data. For Chinese inputs you'd be running UTF-16 functions, for French UTF-8. Donn Cave, donn@avvanta.com

Johan Tibell wrote:
On Fri, Aug 13, 2010 at 4:24 PM, Kevin Jardine
mailto:kevinjardine@gmail.com> wrote: One of the more puzzling aspects of Haskell for newbies is the large number of libraries that appear to provide similar/duplicate functionality.
I agree.
Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text. Those libraries have benchmarks and have been well tuned by experienced Haskelleres and should be the fastest and most memory compact in most cases. There are still a few cases where String beats Text but they are being worked on as we speak.
Interesting. I've never even heard of Data.Text. When did that come into existence? More importantly: How does the average random Haskeller discover that a package has become available that might be relevant to their work?

Andrew Coppin
Interesting. I've never even heard of Data.Text. When did that come into existence?
The first version hit Hackage in February last year...
More importantly: How does the average random Haskeller discover that a package has become available that might be relevant to their work?
Look on Hackage; subscribe to mailing lists (where package maintainers should really write announcement emails), etc. It's rather surprising you haven't heard of text: it is for benchmarking this that Bryan wrote criterion; there's emails on -cafe and blog posts that mention it on a semi-regular basis, etc. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Ivan Lazar Miljenovic wrote:
Andrew Coppin
writes: More importantly: How does the average random Haskeller discover that a package has become available that might be relevant to their work?
Look on Hackage; subscribe to mailing lists (where package maintainers should really write announcement emails), etc.
OK. I guess I must have missed that one...
It's rather surprising you haven't heard of text: it is for benchmarking this that Bryan wrote criterion; there's emails on -cafe and blog posts that mention it on a semi-regular basis, etc.
Well, I suppose I don't do a lot of text processing work... If all you're trying to do is parse commands from an interactive terminal prompt, [Char] is probably good enough. (What I do do is process big chunks of binary data - which is what ByteString is intended for.)

Andrew Coppin
Well, I suppose I don't do a lot of text processing work... If all you're trying to do is parse commands from an interactive terminal prompt, [Char] is probably good enough.
Neither do I, yet I've heard of it... ;-) -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

andrewcoppin:
Interesting. I've never even heard of Data.Text. When did that come into existence?
More importantly: How does the average random Haskeller discover that a package has become available that might be relevant to their work?
In this case, Data.Text has been announced on this very list several times: Text 0.7 announcement http://www.haskell.org/pipermail/haskell-cafe/2009-December/070866.html Text 0.5 announcement http://www.haskell.org/pipermail/haskell-cafe/2009-October/067517.html Text 0.2 announcement http://www.haskell.org/pipermail/haskell-cafe/2009-May/061800.html Text 0.1 annoucnement http://www.haskell.org/pipermail/haskell-cafe/2009-February/056723.html As well as on Planet Haskell several times: Finally! Fast Unicode support for Haskell http://www.serpentine.com/blog/2009/02/27/finally-fast-unicode-support-for-h... Streaming Unicode support for Haskell: text 0.2 http://www.serpentine.com/blog/2009/05/22/streaming-unicode-support-for-hask... Case conversion and text 0.3 http://www.serpentine.com/blog/2009/06/07/case-conversion-and-text-03/ As well as being presented at Anglo Haskell http://www.wellquite.org/non-blog/AngloHaskell2008/tom%20harper.pdf It is mentioned repeatedly in the quarterly Hackage status posts: "vector and text are quickly rising as the preferred arrays and unicode libraries" http://donsbot.wordpress.com/2010/04/03/the-haskell-platform-q1-2010-report/ "text has made it into the top 30 libraries" http://donsbot.wordpress.com/2010/06/30/popular-haskell-packages-q2-2010-rep... Ranked 31st most popular package by June 2010. http://code.haskell.org/~dons/hackage/Jun-2010/popular.txt Ranked 41st most popular package by April 2010. http://www.galois.com/~dons/hackage/april-2010/popularity.csv Ranked 345th by August 2009 http://www.galois.com/~dons/hackage/august-2009/popularity-august-2009.html And discussed on Reddit Haskell many times: http://www.reddit.com/r/haskell/comments/8qfvw/doing_unicode_case_conversion... http://www.reddit.com/r/haskell/comments/80smp/datatext_fast_unicode_bytestr... http://www.reddit.com/r/haskell/comments/80smp/datatext_fast_unicode_bytestr... http://www.reddit.com/r/haskell/comments/ade08/the_performance_of_datatext/ So, to stay up to date, but without drowning in data. Do one of: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell -- Don

Don Stewart
* Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell
And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :( -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

2010/8/15 Ivan Lazar Miljenovic
Don Stewart
writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell
And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :(
If you're interested in a comprehensive update list, you can follow Hackage on Twitter, or the news feed. Cheers, Thu

Vo Minh Thu
2010/8/15 Ivan Lazar Miljenovic
: Don Stewart
writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell
And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :(
If you're interested in a comprehensive update list, you can follow Hackage on Twitter, or the news feed.
Except that that doesn't tell you: * The purpose of the library * How a release differs from a previous one * Why you should use it, etc. Furthermore, several interesting discussions have arisen out of announcement emails. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

2010/8/15 Ivan Lazar Miljenovic
Vo Minh Thu
writes: 2010/8/15 Ivan Lazar Miljenovic
: Don Stewart
writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell
And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :(
If you're interested in a comprehensive update list, you can follow Hackage on Twitter, or the news feed.
Except that that doesn't tell you:
* The purpose of the library * How a release differs from a previous one * Why you should use it, etc.
Furthermore, several interesting discussions have arisen out of announcement emails.
Sure, nor does it write a book chapter about some practical usage. I mean (tongue in cheek) that the other ressource, nor even some proper annoucement, provide all that. I still remember the UHC annoucement (a (nearly) complete Haskell 98 compiler) thread where most of it was about lack of support for n+k pattern. But the bullet list above was to point Andrew a few places where he could have learn about Text. Cheers, Thu

Ivan Lazar Miljenovic
Don Stewart
writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell
And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :(
Subscribe to http://hackage.haskell.org/packages/archive/recent.rss in
your RSS reader: problem solved!
G
--
Gregory Collins

Gregory Collins
Ivan Lazar Miljenovic
writes: Don Stewart
writes: * Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell
And yet there are still many packages that fall under the radar with no announcements of any kind on initial release or even new versions :(
Subscribe to http://hackage.haskell.org/packages/archive/recent.rss in your RSS reader: problem solved!
As I said in reply to someone else: that won't help you get the intent of a library, how it has changed from previous versions, etc. -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

Don Stewart wrote:
So, to stay up to date, but without drowning in data. Do one of:
* Pay attention to Haskell Cafe announcements * Follow the Reddit Haskell news. * Read the quarterly reports on Hackage * Follow Planet Haskell
Interesting. Obviously I look at Haskell Cafe from time to time (although there's usually far too much traffic to follow it all). I wasn't aware of *any* of the other resources listed.

On Fri, Aug 13, 2010 at 10:43 AM, Johan Tibell
Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text. Those libraries have benchmarks and have been well tuned by experienced Haskelleres and should be the fastest and most memory compact in most cases. There are still a few cases where String beats Text but they are being worked on as we speak.
It's a good rule, but I don't know how helpful it is to someone doing
XML processing. From what I can tell, the only XML library that uses
Data.Text is libxml-sax, although tagsoup can probably be easily
extended to use it. HXT, HaXml, and xml all use [Char] internally.
--
Dave Menendez

Pierre-Etienne Meunier wrote:
Hi,
Why don't you use the Data.Rope library ? The asymptotic complexities are way better than those of the ByteString functions.
What I see as my current problem is that there is already a problem having two things Sting and ByteString which represent strings. Add Text and Data.Rope makes that problem worse, not better. Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/
participants (38)
-
anderson leo
-
Andrew Coppin
-
Benedikt Huber
-
Bill Atkins
-
Brandon S Allbery KF8NH
-
Bryan O'Sullivan
-
Bulat Ziganshin
-
Colin Paul Adams
-
Dan Doel
-
Daniel Fischer
-
Daniel Peebles
-
David Menendez
-
Don Stewart
-
Donn Cave
-
Duncan Coutts
-
Edward Z. Yang
-
Erik de Castro Lopo
-
Evan Laforge
-
Felipe Lessa
-
Florian Weimer
-
Gregory Collins
-
Gábor Lehel
-
Ivan Lazar Miljenovic
-
Jason Dagit
-
Jinjing Wang
-
Johan Tibell
-
John Meacham
-
John Millikin
-
Ketil Malde
-
Kevin Jardine
-
Michael Snoyman
-
Pierre-Etienne Meunier
-
Richard O'Keefe
-
Sean Leather
-
Tako Schotanus
-
Vo Minh Thu
-
wren ng thornton
-
Yitzchak Gale