[ANN] unicode-transforms-0.2.0 pure Haskell unicode normalization

Hi, I released unicode-transforms sometime back as bindings to a C library (utf8proc). Since then I have rewritten it completely in Haskell. Haskell data structures are automatically generated from unicode database, so it can be kept up-to-date with the standard unlike the C implementation which was stuck at unicode 5. The implementation comes with a test suite providing 100% code coverage. After a number of algorithmic and implementation efficiency optimizations, I was able to get several times better decompose performance compared to the C implementation. I have not yet got a chance to fully optimize the compose operations but they are still as fast as utf8proc. I would like to thank Antonio Nikishaev for the unicode character database parsing code which I borrowed from the prose library. https://github.com/harendra-kumar/unicode-transforms https://hackage.haskell.org/package/unicode-transforms -harendra

Interesting! What would you say allowed you to get better decompose
performance than the C library?
Will
On Tue, Oct 25, 2016 at 11:59 AM, Harendra Kumar
Hi,
I released unicode-transforms sometime back as bindings to a C library (utf8proc). Since then I have rewritten it completely in Haskell. Haskell data structures are automatically generated from unicode database, so it can be kept up-to-date with the standard unlike the C implementation which was stuck at unicode 5. The implementation comes with a test suite providing 100% code coverage.
After a number of algorithmic and implementation efficiency optimizations, I was able to get several times better decompose performance compared to the C implementation. I have not yet got a chance to fully optimize the compose operations but they are still as fast as utf8proc.
I would like to thank Antonio Nikishaev for the unicode character database parsing code which I borrowed from the prose library.
https://github.com/harendra-kumar/unicode-transforms https://hackage.haskell.org/package/unicode-transforms
-harendra
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

I did not fully compare the implementation, I just focussed on getting as
much performance out of the Haskell implementation as was possible. I can
say two things that might have allowed it to be better:
1) I extracted as much as was possible in terms of implementation
efficiency of the Haskell code. So I did not lose there. The code could
have been much simpler without all the optimizations.
2) My implementation may be better in terms of algorithms and data
structures used. Unicode normalization is complicated, the implementation
can differ in many ways making you lose or gain performance.
Beating the utf8proc implementation was easy. The best (highly optimized)
normalization implementation is the ICU C++ implementation and my target
was to get close to that. I got pretty close to it (using llvm backend) in
most benchmarks and even beat it clearly in one benchmark. There are a
couple of enhancements that I filed against GHC, hopefully they will allow
it to be completely at par in all benchmarks. Though the difference may not
matter other than proving that it can be as good.
-harendra
On 25 October 2016 at 22:36, William Yager
Interesting! What would you say allowed you to get better decompose performance than the C library?
Will
On Tue, Oct 25, 2016 at 11:59 AM, Harendra Kumar
wrote:
Hi,
I released unicode-transforms sometime back as bindings to a C library (utf8proc). Since then I have rewritten it completely in Haskell. Haskell data structures are automatically generated from unicode database, so it can be kept up-to-date with the standard unlike the C implementation which was stuck at unicode 5. The implementation comes with a test suite providing 100% code coverage.
After a number of algorithmic and implementation efficiency optimizations, I was able to get several times better decompose performance compared to the C implementation. I have not yet got a chance to fully optimize the compose operations but they are still as fast as utf8proc.
I would like to thank Antonio Nikishaev for the unicode character database parsing code which I borrowed from the prose library.
https://github.com/harendra-kumar/unicode-transforms https://hackage.haskell.org/package/unicode-transforms
-harendra
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
participants (2)
-
Harendra Kumar
-
William Yager