Lightweight Unicode normalization library

I looked around and found only one package, text-icu which provides unicode normalization operations and a lot more. But text-icu depends on the icu library being installed on the system. We would prefer to avoid dependency on the icu library. Is there a lightweight alternative which does not depend on icu? It could be a pure Haskell package or bindings to a lightweight C library where the library is small and shipped with the package itself. I wonder if there is a need for unicode normalization operations in GHC code itself? If so how does it handle that? I found a lightweight C library (https://github.com/JuliaLang/utf8proc) for normalization and case folding used by the Julia lang project. If there is no other option I am considering creating bindings to this library. Any pointers, thoughts? Thanks, Harendra

I don’t have a good answer, but I thought I’d mention this project which looks interesting and I’m considering using myself: https://github.com/llelf/prose -- Rob Leslie rob@mars.org
On Mar 17, 2016, at 12:59 AM, Harendra Kumar
wrote: I looked around and found only one package, text-icu which provides unicode normalization operations and a lot more. But text-icu depends on the icu library being installed on the system. We would prefer to avoid dependency on the icu library.
Is there a lightweight alternative which does not depend on icu? It could be a pure Haskell package or bindings to a lightweight C library where the library is small and shipped with the package itself.
I wonder if there is a need for unicode normalization operations in GHC code itself? If so how does it handle that?
I found a lightweight C library (https://github.com/JuliaLang/utf8proc https://github.com/JuliaLang/utf8proc) for normalization and case folding used by the Julia lang project. If there is no other option I am considering creating bindings to this library.
Any pointers, thoughts?
Thanks, Harendra _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Ah, I created a package for unicode normalization already since I got no
responses to my mail:
https://github.com/harendra-kumar/unicode-transforms
I will take a look at prose as well since it is native Haskell. It does not
seem to be on Hackage yet.
-harendra
On 25 March 2016 at 05:08, Rob Leslie
I don’t have a good answer, but I thought I’d mention this project which looks interesting and I’m considering using myself:
https://github.com/llelf/prose
-- Rob Leslie rob@mars.org
On Mar 17, 2016, at 12:59 AM, Harendra Kumar
wrote: I looked around and found only one package, text-icu which provides unicode normalization operations and a lot more. But text-icu depends on the icu library being installed on the system. We would prefer to avoid dependency on the icu library.
Is there a lightweight alternative which does not depend on icu? It could be a pure Haskell package or bindings to a lightweight C library where the library is small and shipped with the package itself.
I wonder if there is a need for unicode normalization operations in GHC code itself? If so how does it handle that?
I found a lightweight C library (https://github.com/JuliaLang/utf8proc) for normalization and case folding used by the Julia lang project. If there is no other option I am considering creating bindings to this library.
Any pointers, thoughts?
Thanks, Harendra _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

I looked at prose and did some tests. It builds and works very well
functionally (normalization tests passed) but turns out to be pretty poor
on normalization performance (171 times slower than text-icu). I believe
it can be improved with some changes to the data structures. Though the
performance may or may not matter depending on your use case.
Here are the results of a quick normalization benchmarking test that I did
using text-icu, unicode-transforms (bindings to the utf8proc C library) and
prose:
text-icu = 1 sec (224 MB/s on the test machine)
unicode-transforms = 6 sec (40 MB/s)
prose = 171 sec (1.3 MB/s)
It looks like icu is the gold standard in performance. Even GNU
libunistring's performance seems to be very similar to utf8proc.
-harendra
On 25 March 2016 at 15:57, Harendra Kumar
Ah, I created a package for unicode normalization already since I got no responses to my mail:
https://github.com/harendra-kumar/unicode-transforms
I will take a look at prose as well since it is native Haskell. It does not seem to be on Hackage yet.
-harendra
On 25 March 2016 at 05:08, Rob Leslie
wrote: I don’t have a good answer, but I thought I’d mention this project which looks interesting and I’m considering using myself:
https://github.com/llelf/prose
-- Rob Leslie rob@mars.org
On Mar 17, 2016, at 12:59 AM, Harendra Kumar
wrote: I looked around and found only one package, text-icu which provides unicode normalization operations and a lot more. But text-icu depends on the icu library being installed on the system. We would prefer to avoid dependency on the icu library.
Is there a lightweight alternative which does not depend on icu? It could be a pure Haskell package or bindings to a lightweight C library where the library is small and shipped with the package itself.
I wonder if there is a need for unicode normalization operations in GHC code itself? If so how does it handle that?
I found a lightweight C library (https://github.com/JuliaLang/utf8proc) for normalization and case folding used by the Julia lang project. If there is no other option I am considering creating bindings to this library.
Any pointers, thoughts?
Thanks, Harendra _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
participants (2)
-
Harendra Kumar
-
Rob Leslie