
On 30 October 2012 04:52, John MacFarlane
I have experimented with a couple of variants that seem better than the definition I originally proposed.
The most promising is
isSpace_Alt6 :: Char -> Bool {-# INLINE isSpace_Alt6 #-} isSpace_Alt6 ' ' = True isSpace_Alt6 '\n' = True isSpace_Alt6 '\t' = True isSpace_Alt6 '\r' = True isSpace_Alt6 '\x0B' = True isSpace_Alt6 '\x0C' = True isSpace_Alt6 '\xA0' = True isSpace_Alt6 c | c < '\x1680' = False | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
Is there any particular reason you're using a guard rather than a pattern match for the \x1680 case?
Benchmarks can be found here:
the program : http://johnmacfarlane.net/isSpace/BenchIsSpace.hs results: with ghc --make : http://johnmacfarlane.net/isSpace/unoptimized.html with ghc --make -O2: http://johnmacfarlane.net/isSpace/optimized.html
John
+++ John MacFarlane [Oct 28 12 12:16 ]:
I think that 'isSpace' from Data.Char (and hence also 'words' from the Prelude) is not as fast as it could be. Here is the definition (which is actually found in GHC.Unicode):
isSpace :: Char -> Bool isSpace c = c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\f' || c == '\v' || c == '\xa0' || iswspace (fromIntegral (C.ord c)) /= 0
I presume that the point of the disjuncts at the beginning is to avoid the call to iswspace for the most common space characters. The problem is that most characters (in most texts) are not space characters, and for nonspace characters iswspace will always be called.
So I investigated a possible optimization that would also check for the most common nonspace characters before calling iswspace:
isSpace_Alt :: Char -> Bool isSpace_Alt c | c > '\x20' && c < '\xa0' = False | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
In my benchmarks, this function significantly outperforms isSpace. I also found that a version of isSpace that does not check for nonspace characters, but uses case matching instead of a disjunction of equality tests, outperformed isSpace (but was usually not as fast as isSpace_Alt, and the difference mostly disappears with -O2):
isSpace_Pattern :: Char -> Bool isSpace_Pattern c | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
I benchmarked all three functions against five types of text (all ascii, all Greek, Haskell code, characters 0..255, and all spaces), and here are the (normalized) results:
Compiled with 'ghc --make': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.54 0.17 greek text 1.0 0.57 0.71 haskell code 1.0 0.57 0.24 chars 0..255 1.0 0.54 0.39 all space chars 1.0 0.70 0.90 --------------------------------------------------------------
Compiled with 'ghc --make -O2': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.93 0.40 greek text 1.0 0.98 0.99 haskell code 1.0 1.03 0.58 chars 0..255 1.0 0.92 0.62 all space chars 1.0 0.88 0.92 --------------------------------------------------------------
My benchmark program can be found here: https://gist.github.com/3967761
I'd like to propose that we consider replacing the definition of isSpace with isSpace_Alt.
John
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com