
+1
It might be worth hunting for similar low hanging fruit as well.
-Edward
On Oct 28, 2012, at 3:16 PM, John MacFarlane
I think that 'isSpace' from Data.Char (and hence also 'words' from the Prelude) is not as fast as it could be. Here is the definition (which is actually found in GHC.Unicode):
isSpace :: Char -> Bool isSpace c = c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\f' || c == '\v' || c == '\xa0' || iswspace (fromIntegral (C.ord c)) /= 0
I presume that the point of the disjuncts at the beginning is to avoid the call to iswspace for the most common space characters. The problem is that most characters (in most texts) are not space characters, and for nonspace characters iswspace will always be called.
So I investigated a possible optimization that would also check for the most common nonspace characters before calling iswspace:
isSpace_Alt :: Char -> Bool isSpace_Alt c | c > '\x20' && c < '\xa0' = False | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
In my benchmarks, this function significantly outperforms isSpace. I also found that a version of isSpace that does not check for nonspace characters, but uses case matching instead of a disjunction of equality tests, outperformed isSpace (but was usually not as fast as isSpace_Alt, and the difference mostly disappears with -O2):
isSpace_Pattern :: Char -> Bool isSpace_Pattern c | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
I benchmarked all three functions against five types of text (all ascii, all Greek, Haskell code, characters 0..255, and all spaces), and here are the (normalized) results:
Compiled with 'ghc --make': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.54 0.17 greek text 1.0 0.57 0.71 haskell code 1.0 0.57 0.24 chars 0..255 1.0 0.54 0.39 all space chars 1.0 0.70 0.90 --------------------------------------------------------------
Compiled with 'ghc --make -O2': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.93 0.40 greek text 1.0 0.98 0.99 haskell code 1.0 1.03 0.58 chars 0..255 1.0 0.92 0.62 all space chars 1.0 0.88 0.92 --------------------------------------------------------------
My benchmark program can be found here: https://gist.github.com/3967761
I'd like to propose that we consider replacing the definition of isSpace with isSpace_Alt.
John
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries