
Benedikt Huber
Despite of all this, I think the performance of the text package is very promising, and hope it will improve further!
I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes. A large fraction - probably most - textual data isn't natural language text, but data formatted in textual form, and these formats are typically restricted to ASCII (except for a few text fields). For instance, a typical project for me might be 10-100GB of data, mostly in various text formats, "real" text only making up a few percent of this. The combined (all languages) Wikipedia is 2G words, probably less than 20GB. Being agnostic about string encoding - viz. treating it as bytes - works okay, but it would be nice to allow Unicode in the bits that actually are text, like string fields and labels and such. Due to the sizes involved, I think that in order to efficiently process text-formatted data, UTF-8 is the no-brainer choice for encoding -- certainly in storage, but also for in-memory processing. Unfortunately, there is no clear Data.Text-like effort here. There's (at least): utf8-string - provides utf-8 encoded lazy and strict bytestrings as well as some other data types (and a common class) and System.Environment functionality. utf8-light - provides encoding/decoding to/from (strict?) bytestrings regex-tdfa-utf8 - regular expressions on UTF-8 encoded lazy bytestrings utf8-env - provides an UTF8 aware System.Environment uhexdump - hex dumps for UTF-8 (?) compact-string - support for many different string encodings compact-string-fix - indicates that the above is unmaintained
From a quick glance, it appears that utf8-string is the most complete and well maintained of the crowd, but I could be wrong. It'd be nice if a similar effort as Data.Text has seen could be applied to e.g. utf8-string, to produce a similarly efficient and effective library and allow the deprecation of the others. IMO, this could in time replace .Char8 as the default ByteString string representation. Hackathon, anyone?
-k -- If I haven't seen further, it is by standing in the footprints of giants