
Does there exist a Haskell library or function for getting grapheme lengths of String/Text values? e.g. right now, Prelude.length ("ä" :: String) == 2 Data.Text.length ("ä" :: Text) == 2 But I'd like, either for String or Text, to get 1 instead (there are 2 code points but only 1 grapheme). Thanks, Tom

On Fri, Feb 19, 2021 at 06:05:12PM -0700, amindfv--- via Haskell-Cafe wrote:
Does there exist a Haskell library or function for getting grapheme lengths of String/Text values?
Depends on your definition of "grapheme length" :-) If you're OK with counting NFC code points, then the answer is yes, via the "text-icu" package. $ cabal repl -z -v0 \ --repl-options "-package=text-icu" \ --repl-options "-package=text" \ --repl-options -XOverloadedStrings λ> import qualified Data.Text as T λ> import Data.Text.ICU.Normalize λ> length $ T.unpack $ normalize NFC "ä" 1 λ> length $ T.unpack $ normalize NFD "ä" 2 λ> length $ T.unpack $ normalize NFC $ normalize NFD "ä" 1 With the "Data.Text.ICU.Char" module, it may be possible to determine grapheme boundaries: https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Char... -- Viktor.

On Fri, Feb 19, 2021 at 09:03:44PM -0500, Viktor Dukhovni wrote:
On Fri, Feb 19, 2021 at 06:05:12PM -0700, amindfv--- via Haskell-Cafe wrote:
Does there exist a Haskell library or function for getting grapheme lengths of String/Text values?
Depends on your definition of "grapheme length" :-) If you're OK with counting NFC code points, then the answer is yes, via the "text-icu" package.
$ cabal repl -z -v0 \ --repl-options "-package=text-icu" \ --repl-options "-package=text" \ --repl-options -XOverloadedStrings λ> import qualified Data.Text as T λ> import Data.Text.ICU.Normalize λ> length $ T.unpack $ normalize NFC "ä" 1 λ> length $ T.unpack $ normalize NFD "ä" 2 λ> length $ T.unpack $ normalize NFC $ normalize NFD "ä" 1
Thanks. Unfortunately this doesn't work well for graphemes which don't have a 1-code-point equivalent, like: length $ T.unpack $ normalize NFC $ normalize NFD "❤️" == 2
With the "Data.Text.ICU.Char" module, it may be possible to determine grapheme boundaries:
https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Char...
I'll look into this and report back. Tom

With the "Data.Text.ICU.Char" module, it may be possible to determine grapheme boundaries:
https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Char...
I'll look into this and report back.
I'm quite prepared to believe this is wrong/misguided, but I was able to hack something together that works for my uses so far: import Data.Text.ICU.Char len = length . filter (==Nothing) . map (property GraphemeClusterBreak) . T.unpack Example: len ("🤣h👩🏻elloä❤️❤️👩❤️👩" :: Text) == 13 Tom

On Feb 20, 2021, at 3:59 AM, amindfv--- via Haskell-Cafe
wrote: With the "Data.Text.ICU.Char" module, it may be possible to determine grapheme boundaries:
https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Char...
I'll look into this and report back.
I'm quite prepared to believe this is wrong/misguided, but I was able to hack something together that works for my uses so far:
import Data.Text.ICU.Char len = length . filter (==Nothing) . map (property GraphemeClusterBreak) . T.unpack
Example:
len ("🤣h👩🏻elloä❤️❤️👩❤️👩" :: Text) == 13
There's unfortunately at least one problem, which requires attention from a text-icu maintainer, but AFAIK, there isn't one just at the moment (see the libraries list archive). The issue is that recent "icu" versions return GraphemClusterBreak values that outside the range known to the "Char" module: https://github.com/haskell/text-icu/blob/36c2cf236da06cb3b08fa8e5c3981d784d4... but it blithely calls "toEnum" on whatever the FFI call returns, and triggers an error: [Nothing,*** Exception: toEnum{GraphemeClusterBreak}: tag (16) is outside of enumeration's range (0,10) CallStack (from HasCallStack): error, called at Data/Text/ICU/Char.hsc:865:19 in text-icu-0.7.0.1-08bd532cd2c809ab3173b6766231a799217ecc9a166de7458474e8784471d168:Data.Text.ICU.Char But in fact, exactly some of the new code points are relevant for detection of grapheme cluster boundaries (your algorithm looks too naïve) see: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules When citing the Unicode definition of grapheme clusters, it must be clear which of the two alternatives are being specified: extended versus legacy. Break at the start and end of text, unless the text is empty. GB1 sot ÷ Any GB2 Any ÷ eot Do not break between a CR and LF. Otherwise, break before and after controls. GB3 CR × LF GB4 (Control | CR | LF) ÷ GB5 ÷ (Control | CR | LF) Do not break Hangul syllable sequences. GB6 L × (L | V | LV | LVT) GB7 (LV | V) × (V | T) GB8 (LVT | T) × T Do not break before extending characters or ZWJ. GB9 × (Extend | ZWJ) The GB9a and GB9b rules only apply to extended grapheme clusters: Do not break before SpacingMarks, or after Prepend characters. GB9a × SpacingMark GB9b Prepend × Do not break within emoji modifier sequences or emoji zwj sequences. GB11 \p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic} Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point. GB12 sot (RI RI)* RI × RI GB13 [^RI] (RI RI)* RI × RI Otherwise, break everywhere. GB999 Any ÷ Any Notes: • Grapheme cluster boundaries can be transformed into simple regular expressions. For more information, see Section 6.3, State Machines. • The Grapheme_Base and Grapheme_Extend properties predated the development of the Grapheme_Cluster_Break property. The set of characters with Grapheme_Extend=Yes is used to derive the set of characters with Grapheme_Cluster_Break=Extend. However, the Grapheme_Base property proved to be insufficient for determining grapheme cluster boundaries. Grapheme_Base is no longer used by this specification. -- Viktor.

On Feb 20, 2021, at 5:56 AM, Viktor Dukhovni
wrote: But in fact, exactly some of the new code points are relevant for detection of grapheme cluster boundaries (your algorithm looks too naïve) see:
More importantly, the ICU documentation does not recommend working with the underlying low-level properties and rules. Rather the suggested way to traverse a string one grapheme at a time is to use a BreakIterator: https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boun... Fortunately, these are also supported: https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Brea... So my referral to the "Char" module probably led you astray. Sorry about that... -- Viktor.

On Sat, Feb 20, 2021 at 06:12:58AM -0200, Viktor Dukhovni wrote:
Fortunately, these are also supported:
https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Brea...
A complete example (the NFC normalisation may be overkill): {-# LANGUAGE BangPatterns #-} module Main (main) where import qualified Data.Text as T import qualified Data.Text.Lazy as LT import qualified Data.Text.Lazy.Builder as LT import qualified Data.Text.Lazy.IO as LT import qualified Data.Text.Lazy.Builder.Int as LT import Data.Text.ICU.Break import Data.Text.ICU.Normalize import Data.Text.ICU.Types (LocaleName(..)) import System.Environment main :: IO () main = do brkIter <- breakCharacter Current "" getArgs >>= mapM_ (go brkIter . normalize NFC . T.pack) where go :: BreakIterator () -> T.Text -> IO () go b t = do setText b t len <- count b 0 LT.putStrLn $ LT.toLazyText $ LT.fromText t <> LT.fromString " has grapheme length: " <> LT.decimal len where count :: Int -> IO Int count !acc = next b >>= maybe (pure acc) (const $ count $ acc + 1) -- Viktor.
participants (2)
-
amindfv@mailbox.org
-
Viktor Dukhovni