
Hello all, I'm in need of a Unicode normalization function, Utf8 NFC for ByteString in particular. From some quick Googling around it looks like the only available option is to use ICU in some form. The text-icu package has a nice binding to it, but unfortunately that means a lot of redundant conversions (Utf8 ByteString -> Text; Text -> Utf8 ByteString) and an additional rather large non-Haskell dependency[1]. Is ICU really the only available implementation of normalization? The TR15 doesn't really give a complete algorithm and only hints at the "numerous opportunities for optimization" implicit in the complexity of the spec. [1] Which is especially annoying on OSX since OSX does ship with libicu in a public location, but it doesn't provide header files and apparently it's incomplete somehow, meaning you'd have to reinstall it for text-icu to use (and hilarity ensues when your copy gets out of sync with the OS's). -- Live well, ~wren

It may be the only highly optimised implementation with source code
available.
I needed to implement this some years ago in Eiffel. I did it in a fairly
straight-forward way, without worrying too much about optimal code, as my
requirement was merely to conform. The nice thing about implementing
normalization is that the test file provided on unicode.org means you can
automate generation of a comprehensive test suite. So you can implement a
naive algorithm (as I did), and then progressively optimize (which I never
got round to doing), with 100% assurance that you're not introducing bugs.
On 20 April 2011 03:02, wren ng thornton
Hello all,
I'm in need of a Unicode normalization function, Utf8 NFC for ByteString in particular. From some quick Googling around it looks like the only available option is to use ICU in some form. The text-icu package has a nice binding to it, but unfortunately that means a lot of redundant conversions (Utf8 ByteString -> Text; Text -> Utf8 ByteString) and an additional rather large non-Haskell dependency[1].
Is ICU really the only available implementation of normalization? The TR15 doesn't really give a complete algorithm and only hints at the "numerous opportunities for optimization" implicit in the complexity of the spec.
[1] Which is especially annoying on OSX since OSX does ship with libicu in a public location, but it doesn't provide header files and apparently it's incomplete somehow, meaning you'd have to reinstall it for text-icu to use (and hilarity ensues when your copy gets out of sync with the OS's).
-- Live well, ~wren
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Colin Adams Preston, Lancashire, ENGLAND () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments
participants (2)
-
Colin Adams
-
wren ng thornton