I looked at prose and did some tests. It builds and works very well functionally (normalization tests passed) but turns out to be pretty poor on normalization performance (171 times slower than text-icu). I believe it can be improved with some changes to the data structures. Though the performance may or may not matter depending on your use case.

Here are the results of a quick normalization benchmarking test that I did using text-icu, unicode-transforms (bindings to the utf8proc C library) and prose:

text-icu = 1 sec (224 MB/s on the test machine)

unicode-transforms = 6 sec (40 MB/s)

prose = 171 sec (1.3 MB/s)

It looks like icu is the gold standard in performance. Even GNU libunistring's performance seems to be very similar to utf8proc.

-harendra

On 25 March 2016 at 15:57, Harendra Kumar <harendra.kumar@gmail.com> wrote:

Ah, I created a package for unicode normalization already since I got no responses to my mail:

https://github.com/harendra-kumar/unicode-transforms

I will take a look at prose as well since it is native Haskell. It does not seem to be on Hackage yet.

-harendra

On 25 March 2016 at 05:08, Rob Leslie <rob@mars.org> wrote:
I don’t have a good answer, but I thought I’d mention this project which looks interesting and I’m considering using myself:

https://github.com/llelf/prose

--
Rob Leslie
rob@mars.org

On Mar 17, 2016, at 12:59 AM, Harendra Kumar <harendra.kumar@gmail.com> wrote:

I looked around and found only one package, text-icu which provides unicode normalization operations and a lot more. But text-icu depends on the icu library being installed on the system. We would prefer to avoid dependency on the icu library.

Is there a lightweight alternative which does not depend on icu? It could be a pure Haskell package or bindings to a lightweight C library where the library is small and shipped with the package itself.

I wonder if there is a need for unicode normalization operations in GHC code itself? If so how does it handle that?

I found a lightweight C library (https://github.com/JuliaLang/utf8proc) for normalization and case folding used by the Julia lang project. If there is no other option I am considering creating bindings to this library.

Any pointers, thoughts?

Thanks,
Harendra
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe