
On Wed, Oct 20, 2010 at 09:57:04AM -0700, Bryan O'Sullivan wrote:
On Wed, Oct 20, 2010 at 9:52 AM, Johan Tibell
wrote: I think the right thing to do here is to perform normalization first but I'm not sure.
Hi, friendly neighbourhood Unicode expert here. Yes, in the case Ian cites, you want to perform normalization before doing the replacement. The behaviour he demonstrates is normal, expected, and consistent with the standard.
OK, so that works with the previous example: Data.Text Data.Text.IO Data.Text.ICU> let t = pack "z\x0061\x030A\x0061z" Data.Text Data.Text.IO Data.Text.ICU> t "za\778az" Data.Text Data.Text.IO Data.Text.ICU> putStrLn t zåaz Data.Text Data.Text.IO Data.Text.ICU> normalize NFC t "z\229az" Data.Text Data.Text.IO Data.Text.ICU> putStrLn (normalize NFC t) zåaz Data.Text Data.Text.IO Data.Text.ICU> putStrLn (replace (pack "a") (pack "y") (normalize NFC t)) zåyz but only because now characters and codepoints are 1:1. If we were using a character for which there is no code point, e.g. (the probably non-existent, but I understand there are real examples) p-ring: Data.Text Data.Text.IO Data.Text.ICU> let t = pack "zp\x030Apz" Data.Text Data.Text.IO Data.Text.ICU> t "zp\778pz" Data.Text Data.Text.IO Data.Text.ICU> putStrLn t zp̊pz Data.Text Data.Text.IO Data.Text.ICU> normalize NFC t "zp\778pz" Data.Text Data.Text.IO Data.Text.ICU> putStrLn (normalize NFC t) zp̊pz Data.Text Data.Text.IO Data.Text.ICU> putStrLn (replace (pack "p") (pack "y") (normalize NFC t)) zẙyz then it doesn't work. Johan wrote:
If you process a string code point by code point you might mistakenly confuse a plain "a" (A) with a "å" (A-RING *or* A + COMBINING RING).
But when characters and codepoints are 1:1, you /can/ process code point by code point. Am I missing something? Thanks Ian