
On Feb 20, 2021, at 3:59 AM, amindfv--- via Haskell-Cafe
wrote: With the "Data.Text.ICU.Char" module, it may be possible to determine grapheme boundaries:
https://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU-Char...
I'll look into this and report back.
I'm quite prepared to believe this is wrong/misguided, but I was able to hack something together that works for my uses so far:
import Data.Text.ICU.Char len = length . filter (==Nothing) . map (property GraphemeClusterBreak) . T.unpack
Example:
len ("🤣h👩🏻elloä❤️❤️👩❤️👩" :: Text) == 13
There's unfortunately at least one problem, which requires attention from a text-icu maintainer, but AFAIK, there isn't one just at the moment (see the libraries list archive). The issue is that recent "icu" versions return GraphemClusterBreak values that outside the range known to the "Char" module: https://github.com/haskell/text-icu/blob/36c2cf236da06cb3b08fa8e5c3981d784d4... but it blithely calls "toEnum" on whatever the FFI call returns, and triggers an error: [Nothing,*** Exception: toEnum{GraphemeClusterBreak}: tag (16) is outside of enumeration's range (0,10) CallStack (from HasCallStack): error, called at Data/Text/ICU/Char.hsc:865:19 in text-icu-0.7.0.1-08bd532cd2c809ab3173b6766231a799217ecc9a166de7458474e8784471d168:Data.Text.ICU.Char But in fact, exactly some of the new code points are relevant for detection of grapheme cluster boundaries (your algorithm looks too naïve) see: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules When citing the Unicode definition of grapheme clusters, it must be clear which of the two alternatives are being specified: extended versus legacy. Break at the start and end of text, unless the text is empty. GB1 sot ÷ Any GB2 Any ÷ eot Do not break between a CR and LF. Otherwise, break before and after controls. GB3 CR × LF GB4 (Control | CR | LF) ÷ GB5 ÷ (Control | CR | LF) Do not break Hangul syllable sequences. GB6 L × (L | V | LV | LVT) GB7 (LV | V) × (V | T) GB8 (LVT | T) × T Do not break before extending characters or ZWJ. GB9 × (Extend | ZWJ) The GB9a and GB9b rules only apply to extended grapheme clusters: Do not break before SpacingMarks, or after Prepend characters. GB9a × SpacingMark GB9b Prepend × Do not break within emoji modifier sequences or emoji zwj sequences. GB11 \p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic} Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point. GB12 sot (RI RI)* RI × RI GB13 [^RI] (RI RI)* RI × RI Otherwise, break everywhere. GB999 Any ÷ Any Notes: • Grapheme cluster boundaries can be transformed into simple regular expressions. For more information, see Section 6.3, State Machines. • The Grapheme_Base and Grapheme_Extend properties predated the development of the Grapheme_Cluster_Break property. The set of characters with Grapheme_Extend=Yes is used to derive the set of characters with Grapheme_Cluster_Break=Extend. However, the Grapheme_Base property proved to be insufficient for determining grapheme cluster boundaries. Grapheme_Base is no longer used by this specification. -- Viktor.