[GSoC] Text/UTF-8: Call for Benchmarks

Hello all, I'm very glad that I have been accepted again this year for the Google Summer of Code [1] program for haskell.org. My project aims to improve the text [2] library by converting it to internally use UTF-8 instead of UTF-16. UTF-8 and UTF-16 both have advantages and disadvantages, which actually makes it a pretty complicated choice. I've written about this a little in my [3] (especially see Tom Harper's master dissertation if you're interested in the subject). To support a decision here on UTF-8 vs. UTF-16, lots of benchmarks will be needed. Hence, this is the first focus of the GSoC project: collecting a large benchmark suite which models real-world usage of the text library. This is why I'd like to ask everyone who has written/knows libraries or applications that use the text library extensively to inform me of these efforts. The reverse dependencies list on Hackage is a good starting point for me but it doesn't point out how popular these packages are and how intensively they use the text library. I will then convert a subset of this code to a benchmark suite using criterion. Open source code means more reliable benchmarks, because I can publish the code I used for them. However, I'm also willing to sign non-disclosure agreements if this means I can try out what effects the changes have on large systems. There's several ways to contact me: you can reply to this thread, or you can mail me privately using `jaspervdj+text@gmail.com`. Thanks in advance for any help! [1]: http://code.google.com/soc/ [2]: http://hackage.haskell.org/package/text [3]: http://jaspervdj.be/files/2011-gsoc-text-utf8-proposal.html Cheers, Jasper

On Wed, Apr 27, 2011 at 8:24 AM, Jasper Van der Jeugt
UTF-8 and UTF-16 both have advantages and disadvantages, which actually makes it a pretty complicated choice. I've written about this a little in my [3] (especially see Tom Harper's master dissertation if you're interested in the subject).
About [3]: conversion to UTF-8 from UTF-8 while reading isn't O(1) because you have to at least check if it really is valid UTF-8. On the other hand, writing shouldn't need anything because the library guarantees that all Texts have valid internal representations. Thanks, -- Felipe.
participants (2)
-
Felipe Almeida Lessa
-
Jasper Van der Jeugt