
Sounds good to me. Thanks for doing this. When you think you are ready, just submit a patch. (As others have noted, maybe isSpace isn't the only function that could benefit from this kind of attention.) Simon | -----Original Message----- | From: libraries-bounces@haskell.org [mailto:libraries-bounces@haskell.org] On | Behalf Of John MacFarlane | Sent: 28 October 2012 19:16 | To: libraries@haskell.org | Subject: Improving Data.Char.isSpace performance | | I think that 'isSpace' from Data.Char (and hence also 'words' from the | Prelude) is not as fast as it could be. Here is the definition | (which is actually found in GHC.Unicode): | | isSpace :: Char -> Bool | isSpace c = | c == ' ' || | c == '\t' || | c == '\n' || | c == '\r' || | c == '\f' || | c == '\v' || | c == '\xa0' || | iswspace (fromIntegral (C.ord c)) /= 0 | | I presume that the point of the disjuncts at the beginning is to | avoid the call to iswspace for the most common space characters. | The problem is that most characters (in most texts) are not space | characters, and for nonspace characters iswspace will always be | called. | | So I investigated a possible optimization that would also check for | the most common nonspace characters before calling iswspace: | | isSpace_Alt :: Char -> Bool | isSpace_Alt c | c > '\x20' && c < '\xa0' = False | | c == ' ' = True | | '\t' <= c && c <= '\r' = True | | c == '\xa0' = True | | otherwise = iswspace (fromIntegral (C.ord c)) /= 0 | | In my benchmarks, this function significantly outperforms isSpace. | I also found that a version of isSpace that does not check | for nonspace characters, but uses case matching instead of | a disjunction of equality tests, outperformed isSpace (but | was usually not as fast as isSpace_Alt, and the difference | mostly disappears with -O2): | | isSpace_Pattern :: Char -> Bool | isSpace_Pattern c | | c == ' ' = True | | '\t' <= c && c <= '\r' = True | | c == '\xa0' = True | | otherwise = iswspace (fromIntegral (C.ord c)) /= 0 | | I benchmarked all three functions against five types of text | (all ascii, all Greek, Haskell code, characters 0..255, and | all spaces), and here are the (normalized) results: | | Compiled with 'ghc --make': | -------------------------------------------------------------- | Input isSpace_DataChar isSpace_Pattern isSpace_Alt | --------------- ---------------- --------------- ----------- | ascii text 1.0 0.54 0.17 | greek text 1.0 0.57 0.71 | haskell code 1.0 0.57 0.24 | chars 0..255 1.0 0.54 0.39 | all space chars 1.0 0.70 0.90 | -------------------------------------------------------------- | | Compiled with 'ghc --make -O2': | -------------------------------------------------------------- | Input isSpace_DataChar isSpace_Pattern isSpace_Alt | --------------- ---------------- --------------- ----------- | ascii text 1.0 0.93 0.40 | greek text 1.0 0.98 0.99 | haskell code 1.0 1.03 0.58 | chars 0..255 1.0 0.92 0.62 | all space chars 1.0 0.88 0.92 | -------------------------------------------------------------- | | My benchmark program can be found here: | https://gist.github.com/3967761 | | I'd like to propose that we consider replacing the definition | of isSpace with isSpace_Alt. | | John | | | _______________________________________________ | Libraries mailing list | Libraries@haskell.org | http://www.haskell.org/mailman/listinfo/libraries