Improving Data.Char.isSpace performance

28 Oct 2012

      I think that 'isSpace' from Data.Char (and hence also 'words' from the
Prelude) is not as fast as it could be.  Here is the definition
(which is actually found in GHC.Unicode):

isSpace :: Char -> Bool
isSpace c =
    c == ' '     ||
    c == '\t'    ||
    c == '\n'    ||
    c == '\r'    ||
    c == '\f'    ||
    c == '\v'    ||
    c == '\xa0'  ||
    iswspace (fromIntegral (C.ord c)) /= 0

I presume that the point of the disjuncts at the beginning is to
avoid the call to iswspace for the most common space characters.
The problem is that most characters (in most texts) are not space
characters, and for nonspace characters iswspace will always be
called.

So I investigated a possible optimization that would also check for
the most common nonspace characters before calling iswspace:

isSpace_Alt :: Char -> Bool
isSpace_Alt c | c > '\x20' && c < '\xa0' = False
              | c == ' '                 = True
              | '\t' <= c && c <= '\r'   = True
              | c == '\xa0'              = True
              | otherwise                = iswspace (fromIntegral (C.ord c)) /= 0

In my benchmarks, this function significantly outperforms isSpace.
I also found that a version of isSpace that does not check
for nonspace characters, but uses case matching instead of
a disjunction of equality tests, outperformed isSpace (but
was usually not as fast as isSpace_Alt, and the difference
mostly disappears with -O2):

isSpace_Pattern :: Char -> Bool
isSpace_Pattern c
   | c == ' '                  = True
   | '\t' <= c && c <= '\r'    = True
   | c == '\xa0'               = True
   | otherwise                 = iswspace (fromIntegral (C.ord c)) /= 0

I benchmarked all three functions against five types of text
(all ascii, all Greek, Haskell code, characters 0..255, and
all spaces), and here are the (normalized) results:

Compiled with 'ghc --make':
--------------------------------------------------------------
Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
--------------- ----------------  ---------------  -----------
ascii text      1.0               0.54             0.17
greek text      1.0               0.57             0.71
haskell code    1.0               0.57             0.24
chars 0..255    1.0               0.54             0.39
all space chars 1.0               0.70             0.90
--------------------------------------------------------------

Compiled with 'ghc --make -O2':
--------------------------------------------------------------
Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
--------------- ----------------  ---------------  -----------
ascii text      1.0               0.93             0.40
greek text      1.0               0.98             0.99
haskell code    1.0               1.03             0.58
chars 0..255    1.0               0.92             0.62
all space chars 1.0               0.88             0.92
--------------------------------------------------------------

My benchmark program can be found here:
https://gist.github.com/3967761

I'd like to propose that we consider replacing the definition
of isSpace with isSpace_Alt.

John

John MacFarlane

Johan Tibell

Edward A Kmett

John MacFarlane

Johan Tibell

Simon Peyton-Jones

John MacFarlane

Brandon Allbery

Vincent Hanquez

Ivan Lazar Miljenovic

wren ng thornton

John MacFarlane

Joachim Breitner

Joachim Breitner

Patrick Palka

wren ng thornton

Simon Peyton-Jones

John MacFarlane

John MacFarlane

Johan Tibell

tags

participants (10)