Improving Data.Char.isSpace performance

I think that 'isSpace' from Data.Char (and hence also 'words' from the Prelude) is not as fast as it could be. Here is the definition (which is actually found in GHC.Unicode): isSpace :: Char -> Bool isSpace c = c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\f' || c == '\v' || c == '\xa0' || iswspace (fromIntegral (C.ord c)) /= 0 I presume that the point of the disjuncts at the beginning is to avoid the call to iswspace for the most common space characters. The problem is that most characters (in most texts) are not space characters, and for nonspace characters iswspace will always be called. So I investigated a possible optimization that would also check for the most common nonspace characters before calling iswspace: isSpace_Alt :: Char -> Bool isSpace_Alt c | c > '\x20' && c < '\xa0' = False | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0 In my benchmarks, this function significantly outperforms isSpace. I also found that a version of isSpace that does not check for nonspace characters, but uses case matching instead of a disjunction of equality tests, outperformed isSpace (but was usually not as fast as isSpace_Alt, and the difference mostly disappears with -O2): isSpace_Pattern :: Char -> Bool isSpace_Pattern c | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0 I benchmarked all three functions against five types of text (all ascii, all Greek, Haskell code, characters 0..255, and all spaces), and here are the (normalized) results: Compiled with 'ghc --make': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.54 0.17 greek text 1.0 0.57 0.71 haskell code 1.0 0.57 0.24 chars 0..255 1.0 0.54 0.39 all space chars 1.0 0.70 0.90 -------------------------------------------------------------- Compiled with 'ghc --make -O2': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.93 0.40 greek text 1.0 0.98 0.99 haskell code 1.0 1.03 0.58 chars 0..255 1.0 0.92 0.62 all space chars 1.0 0.88 0.92 -------------------------------------------------------------- My benchmark program can be found here: https://gist.github.com/3967761 I'd like to propose that we consider replacing the definition of isSpace with isSpace_Alt. John

+1
It might be worth hunting for similar low hanging fruit as well.
-Edward
On Oct 28, 2012, at 3:16 PM, John MacFarlane
I think that 'isSpace' from Data.Char (and hence also 'words' from the Prelude) is not as fast as it could be. Here is the definition (which is actually found in GHC.Unicode):
isSpace :: Char -> Bool isSpace c = c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\f' || c == '\v' || c == '\xa0' || iswspace (fromIntegral (C.ord c)) /= 0
I presume that the point of the disjuncts at the beginning is to avoid the call to iswspace for the most common space characters. The problem is that most characters (in most texts) are not space characters, and for nonspace characters iswspace will always be called.
So I investigated a possible optimization that would also check for the most common nonspace characters before calling iswspace:
isSpace_Alt :: Char -> Bool isSpace_Alt c | c > '\x20' && c < '\xa0' = False | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
In my benchmarks, this function significantly outperforms isSpace. I also found that a version of isSpace that does not check for nonspace characters, but uses case matching instead of a disjunction of equality tests, outperformed isSpace (but was usually not as fast as isSpace_Alt, and the difference mostly disappears with -O2):
isSpace_Pattern :: Char -> Bool isSpace_Pattern c | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
I benchmarked all three functions against five types of text (all ascii, all Greek, Haskell code, characters 0..255, and all spaces), and here are the (normalized) results:
Compiled with 'ghc --make': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.54 0.17 greek text 1.0 0.57 0.71 haskell code 1.0 0.57 0.24 chars 0..255 1.0 0.54 0.39 all space chars 1.0 0.70 0.90 --------------------------------------------------------------
Compiled with 'ghc --make -O2': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.93 0.40 greek text 1.0 0.98 0.99 haskell code 1.0 1.03 0.58 chars 0..255 1.0 0.92 0.62 all space chars 1.0 0.88 0.92 --------------------------------------------------------------
My benchmark program can be found here: https://gist.github.com/3967761
I'd like to propose that we consider replacing the definition of isSpace with isSpace_Alt.
John
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

+++ Edward A Kmett [Oct 28 12 17:08 ]:
It might be worth hunting for similar low hanging fruit as well.
No doubt a similar approach would work for isLetter, isDigit, toUpper, toLower, etc. One thing puzzles me, though. Maybe someone here will have an explanation. I added a test case to my benchmark for isSpace *imported* from Data.Char. The imported isSpace benchmarks much faster (up to 5X) than the isSpace_DataChar, even though the latter has the same definition as the former. Wren Thronton noticed this too, and suggests (in http://community.haskell.org/~wren/bytestring-lexing/test/bench/BenchIsSpace...) that the difference "appears to be something special in how the base library was compiled," but I have no idea what that could be. John

On Sun, Oct 28, 2012 at 4:35 PM, John MacFarlane
+++ Edward A Kmett [Oct 28 12 17:08 ]:
It might be worth hunting for similar low hanging fruit as well.
No doubt a similar approach would work for isLetter, isDigit, toUpper, toLower, etc.
One thing puzzles me, though. Maybe someone here will have an explanation. I added a test case to my benchmark for isSpace *imported* from Data.Char. The imported isSpace benchmarks much faster (up to 5X) than the isSpace_DataChar, even though the latter has the same definition as the former. Wren Thronton noticed this too, and suggests (in http://community.haskell.org/~wren/bytestring-lexing/test/bench/BenchIsSpace...) that the difference "appears to be something special in how the base library was compiled," but I have no idea what that could be.
Could be inlining. Compile your benchmark with -ddump-simpl and see if isSpace got inlined or not. -- Johan

| One thing puzzles me, though. Maybe someone here will have an | explanation. I added a test case to my benchmark for isSpace | *imported* from Data.Char. The imported isSpace benchmarks | much faster (up to 5X) than the isSpace_DataChar, even though the latter has | the same definition as the former. Wren Thronton noticed this too, | and suggests (in | http://community.haskell.org/~wren/bytestring- | lexing/test/bench/BenchIsSpace.hs) | that the difference "appears to be something special in how the base | library was compiled," but I have no idea what that could be. A way to reproduce this would be good. (If so, open a ticket.) The only thing I can think of is that libraries may be compiled with -O2, and that might make a difference. Simon | -----Original Message----- | From: libraries-bounces@haskell.org [mailto:libraries-bounces@haskell.org] On | Behalf Of John MacFarlane | Sent: 28 October 2012 23:35 | To: Edward A Kmett | Cc: libraries@haskell.org | Subject: Re: Improving Data.Char.isSpace performance | | +++ Edward A Kmett [Oct 28 12 17:08 ]: | > It might be worth hunting for similar low hanging fruit as well. | | No doubt a similar approach would work for isLetter, isDigit, | toUpper, toLower, etc. | | One thing puzzles me, though. Maybe someone here will have an | explanation. I added a test case to my benchmark for isSpace | *imported* from Data.Char. The imported isSpace benchmarks | much faster (up to 5X) than the isSpace_DataChar, even though the latter has | the same definition as the former. Wren Thronton noticed this too, | and suggests (in | http://community.haskell.org/~wren/bytestring- | lexing/test/bench/BenchIsSpace.hs) | that the difference "appears to be something special in how the base | library was compiled," but I have no idea what that could be. | | John | | | _______________________________________________ | Libraries mailing list | Libraries@haskell.org | http://www.haskell.org/mailman/listinfo/libraries

I have experimented with a couple of variants that seem better than the definition I originally proposed. The most promising is isSpace_Alt6 :: Char -> Bool {-# INLINE isSpace_Alt6 #-} isSpace_Alt6 ' ' = True isSpace_Alt6 '\n' = True isSpace_Alt6 '\t' = True isSpace_Alt6 '\r' = True isSpace_Alt6 '\x0B' = True isSpace_Alt6 '\x0C' = True isSpace_Alt6 '\xA0' = True isSpace_Alt6 c | c < '\x1680' = False | otherwise = iswspace (fromIntegral (C.ord c)) /= 0 Benchmarks can be found here: the program : http://johnmacfarlane.net/isSpace/BenchIsSpace.hs results: with ghc --make : http://johnmacfarlane.net/isSpace/unoptimized.html with ghc --make -O2: http://johnmacfarlane.net/isSpace/optimized.html John +++ John MacFarlane [Oct 28 12 12:16 ]:
I think that 'isSpace' from Data.Char (and hence also 'words' from the Prelude) is not as fast as it could be. Here is the definition (which is actually found in GHC.Unicode):
isSpace :: Char -> Bool isSpace c = c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\f' || c == '\v' || c == '\xa0' || iswspace (fromIntegral (C.ord c)) /= 0
I presume that the point of the disjuncts at the beginning is to avoid the call to iswspace for the most common space characters. The problem is that most characters (in most texts) are not space characters, and for nonspace characters iswspace will always be called.
So I investigated a possible optimization that would also check for the most common nonspace characters before calling iswspace:
isSpace_Alt :: Char -> Bool isSpace_Alt c | c > '\x20' && c < '\xa0' = False | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
In my benchmarks, this function significantly outperforms isSpace. I also found that a version of isSpace that does not check for nonspace characters, but uses case matching instead of a disjunction of equality tests, outperformed isSpace (but was usually not as fast as isSpace_Alt, and the difference mostly disappears with -O2):
isSpace_Pattern :: Char -> Bool isSpace_Pattern c | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
I benchmarked all three functions against five types of text (all ascii, all Greek, Haskell code, characters 0..255, and all spaces), and here are the (normalized) results:
Compiled with 'ghc --make': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.54 0.17 greek text 1.0 0.57 0.71 haskell code 1.0 0.57 0.24 chars 0..255 1.0 0.54 0.39 all space chars 1.0 0.70 0.90 --------------------------------------------------------------
Compiled with 'ghc --make -O2': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.93 0.40 greek text 1.0 0.98 0.99 haskell code 1.0 1.03 0.58 chars 0..255 1.0 0.92 0.62 all space chars 1.0 0.88 0.92 --------------------------------------------------------------
My benchmark program can be found here: https://gist.github.com/3967761
I'd like to propose that we consider replacing the definition of isSpace with isSpace_Alt.
John
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

On Mon, Oct 29, 2012 at 1:52 PM, John MacFarlane
I have experimented with a couple of variants that seem better than the definition I originally proposed.
How about U+0085? -- brandon s allbery kf8nh sine nomine associates allbery.b@gmail.com ballbery@sinenomine.net unix/linux, openafs, kerberos, infrastructure http://sinenomine.net

On 10/29/2012 06:01 PM, Brandon Allbery wrote:
On Mon, Oct 29, 2012 at 1:52 PM, John MacFarlane
mailto:jgm@berkeley.edu> wrote: I have experimented with a couple of variants that seem better than the definition I originally proposed.
How about U+0085?
U+0085 is not defined as a space char according to http://www.fileformat.info/info/unicode/char/85/index.htm. -- Vincent

On 30 October 2012 04:52, John MacFarlane
I have experimented with a couple of variants that seem better than the definition I originally proposed.
The most promising is
isSpace_Alt6 :: Char -> Bool {-# INLINE isSpace_Alt6 #-} isSpace_Alt6 ' ' = True isSpace_Alt6 '\n' = True isSpace_Alt6 '\t' = True isSpace_Alt6 '\r' = True isSpace_Alt6 '\x0B' = True isSpace_Alt6 '\x0C' = True isSpace_Alt6 '\xA0' = True isSpace_Alt6 c | c < '\x1680' = False | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
Is there any particular reason you're using a guard rather than a pattern match for the \x1680 case?
Benchmarks can be found here:
the program : http://johnmacfarlane.net/isSpace/BenchIsSpace.hs results: with ghc --make : http://johnmacfarlane.net/isSpace/unoptimized.html with ghc --make -O2: http://johnmacfarlane.net/isSpace/optimized.html
John
+++ John MacFarlane [Oct 28 12 12:16 ]:
I think that 'isSpace' from Data.Char (and hence also 'words' from the Prelude) is not as fast as it could be. Here is the definition (which is actually found in GHC.Unicode):
isSpace :: Char -> Bool isSpace c = c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\f' || c == '\v' || c == '\xa0' || iswspace (fromIntegral (C.ord c)) /= 0
I presume that the point of the disjuncts at the beginning is to avoid the call to iswspace for the most common space characters. The problem is that most characters (in most texts) are not space characters, and for nonspace characters iswspace will always be called.
So I investigated a possible optimization that would also check for the most common nonspace characters before calling iswspace:
isSpace_Alt :: Char -> Bool isSpace_Alt c | c > '\x20' && c < '\xa0' = False | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
In my benchmarks, this function significantly outperforms isSpace. I also found that a version of isSpace that does not check for nonspace characters, but uses case matching instead of a disjunction of equality tests, outperformed isSpace (but was usually not as fast as isSpace_Alt, and the difference mostly disappears with -O2):
isSpace_Pattern :: Char -> Bool isSpace_Pattern c | c == ' ' = True | '\t' <= c && c <= '\r' = True | c == '\xa0' = True | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
I benchmarked all three functions against five types of text (all ascii, all Greek, Haskell code, characters 0..255, and all spaces), and here are the (normalized) results:
Compiled with 'ghc --make': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.54 0.17 greek text 1.0 0.57 0.71 haskell code 1.0 0.57 0.24 chars 0..255 1.0 0.54 0.39 all space chars 1.0 0.70 0.90 --------------------------------------------------------------
Compiled with 'ghc --make -O2': -------------------------------------------------------------- Input isSpace_DataChar isSpace_Pattern isSpace_Alt --------------- ---------------- --------------- ----------- ascii text 1.0 0.93 0.40 greek text 1.0 0.98 0.99 haskell code 1.0 1.03 0.58 chars 0..255 1.0 0.92 0.62 all space chars 1.0 0.88 0.92 --------------------------------------------------------------
My benchmark program can be found here: https://gist.github.com/3967761
I'd like to propose that we consider replacing the definition of isSpace with isSpace_Alt.
John
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
-- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com http://IvanMiljenovic.wordpress.com

On 10/29/12 6:30 PM, Ivan Lazar Miljenovic wrote:
On 30 October 2012 04:52, John MacFarlane
wrote: I have experimented with a couple of variants that seem better than the definition I originally proposed.
The most promising is
isSpace_Alt6 :: Char -> Bool {-# INLINE isSpace_Alt6 #-} isSpace_Alt6 ' ' = True isSpace_Alt6 '\n' = True isSpace_Alt6 '\t' = True isSpace_Alt6 '\r' = True isSpace_Alt6 '\x0B' = True isSpace_Alt6 '\x0C' = True isSpace_Alt6 '\xA0' = True isSpace_Alt6 c | c < '\x1680' = False | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
Is there any particular reason you're using a guard rather than a pattern match for the \x1680 case?
The \x1680 case comes from the fact that (at present) no characters below there are spaces other than the ones listed. Note that it's using (<) not (==). The only other thing we could do here is to replace the otherwise branch with: isSpace_Alt6 c = iswspace (fromIntegral (C.ord c)) /= 0 but I doubt that'd help either performance or clarity. The one thing I worry about using \x1680 as the threshold[1] is that I'm not sure whether every character below \x1680 has been allocated or whether some are still free. If any of them are free, then this will become incorrect in subsequent versions of Unicode so it's a maintenance timebomb. (Whereas if they're all specified then it should be fine.) Can someone verify that using \x1680 is sound in this manner? [1] I discovered \x1680 by simply checking map isSpace ['\x0'..]. However, in my original proposal of this particular optimization I used \xFF as the boundary, since this is guaranteed by the fact that Unicode is interoperable with ISO-8859-1 (Latin-1) -- Live well, ~wren

+++ wren ng thornton [Oct 31 12 22:39 ]:
The one thing I worry about using \x1680 as the threshold[1] is that I'm not sure whether every character below \x1680 has been allocated or whether some are still free. If any of them are free, then this will become incorrect in subsequent versions of Unicode so it's a maintenance timebomb. (Whereas if they're all specified then it should be fine.) Can someone verify that using \x1680 is sound in this manner?
Really good point. This page [1], if I read it correctly, seems to indicate that c < \x860 would be safe. But I'm not a unicode expert and maybe there's a better place to look. [1] http://www.unicode.org/alloc/CurrentAllocation.html One thing I discovered is that if I use c < \xFF, the Greek benchmark goes out the window -- we are twice as slow as the original GHC.Unicode.isChar. Whatever threshold we choose, it seems, the performance gains below the threshold will be balanced by performance losses above the threshold. This makes me disinclined to submit the patch. It just seems wrong to change the library in a way that helps people who use western alphabets at the expense of people who don't.

Hi, have you considered investigating whether the ffi call can be sped up? To me it seems to be a bit overheady to first check in Haskell-Land for common cases and then call code that would (or does?) this check as well. Maybe the overhead for calling the C code can be reduced so that it is no longer worth doing special casing in the Haskell code. Maybe using primitive operations instead of the C calling convention help, as it would effectively inline the call to u_iswspace? Greetings, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

Hi, Am Donnerstag, den 01.11.2012, 14:24 +0100 schrieb Joachim Breitner:
have you considered investigating whether the ffi call can be sped up? To me it seems to be a bit overheady to first check in Haskell-Land for common cases and then call code that would (or does?) this check as well.
looking at the interface of GHC.Unicode I see that the special cases are included in the unfolding of isSpace. This is, of course, something that that even a faster ffi variant cannot achieve. Greetings, Joachim -- Joachim "nomeata" Breitner mail@joachim-breitner.de | nomeata@debian.org | GPG: 0x4743206C xmpp: nomeata@joachim-breitner.de | http://www.joachim-breitner.de/

On Wed, Oct 31, 2012 at 10:39 PM, wren ng thornton
The one thing I worry about using \x1680 as the threshold[1] is that I'm not sure whether every character below \x1680 has been allocated or whether some are still free. If any of them are free, then this will become incorrect in subsequent versions of Unicode so it's a maintenance timebomb. (Whereas if they're all specified then it should be fine.) Can someone verify that using \x1680 is sound in this manner?
According to GHCi: Prelude Data.Char> length $ filter ((== NotAssigned) . generalCategory)
['\0'..'\x1680'] 830

On 10/31/12 11:49 PM, Patrick Palka wrote:
On Wed, Oct 31, 2012 at 10:39 PM, wren ng thornton
wrote: The one thing I worry about using \x1680 as the threshold[1] is that I'm not sure whether every character below \x1680 has been allocated or whether some are still free. If any of them are free, then this will become incorrect in subsequent versions of Unicode so it's a maintenance timebomb. (Whereas if they're all specified then it should be fine.) Can someone verify that using \x1680 is sound in this manner?
According to GHCi:
Prelude Data.Char> length $ filter ((== NotAssigned) . generalCategory)
['\0'..'\x1680'] 830
Guess I never looked closely at what Unicode queries Data.Char offers... Looks like the first unassigned character is '\888' -- Live well, ~wren

Sounds good to me. Thanks for doing this. When you think you are ready, just submit a patch. (As others have noted, maybe isSpace isn't the only function that could benefit from this kind of attention.) Simon | -----Original Message----- | From: libraries-bounces@haskell.org [mailto:libraries-bounces@haskell.org] On | Behalf Of John MacFarlane | Sent: 28 October 2012 19:16 | To: libraries@haskell.org | Subject: Improving Data.Char.isSpace performance | | I think that 'isSpace' from Data.Char (and hence also 'words' from the | Prelude) is not as fast as it could be. Here is the definition | (which is actually found in GHC.Unicode): | | isSpace :: Char -> Bool | isSpace c = | c == ' ' || | c == '\t' || | c == '\n' || | c == '\r' || | c == '\f' || | c == '\v' || | c == '\xa0' || | iswspace (fromIntegral (C.ord c)) /= 0 | | I presume that the point of the disjuncts at the beginning is to | avoid the call to iswspace for the most common space characters. | The problem is that most characters (in most texts) are not space | characters, and for nonspace characters iswspace will always be | called. | | So I investigated a possible optimization that would also check for | the most common nonspace characters before calling iswspace: | | isSpace_Alt :: Char -> Bool | isSpace_Alt c | c > '\x20' && c < '\xa0' = False | | c == ' ' = True | | '\t' <= c && c <= '\r' = True | | c == '\xa0' = True | | otherwise = iswspace (fromIntegral (C.ord c)) /= 0 | | In my benchmarks, this function significantly outperforms isSpace. | I also found that a version of isSpace that does not check | for nonspace characters, but uses case matching instead of | a disjunction of equality tests, outperformed isSpace (but | was usually not as fast as isSpace_Alt, and the difference | mostly disappears with -O2): | | isSpace_Pattern :: Char -> Bool | isSpace_Pattern c | | c == ' ' = True | | '\t' <= c && c <= '\r' = True | | c == '\xa0' = True | | otherwise = iswspace (fromIntegral (C.ord c)) /= 0 | | I benchmarked all three functions against five types of text | (all ascii, all Greek, Haskell code, characters 0..255, and | all spaces), and here are the (normalized) results: | | Compiled with 'ghc --make': | -------------------------------------------------------------- | Input isSpace_DataChar isSpace_Pattern isSpace_Alt | --------------- ---------------- --------------- ----------- | ascii text 1.0 0.54 0.17 | greek text 1.0 0.57 0.71 | haskell code 1.0 0.57 0.24 | chars 0..255 1.0 0.54 0.39 | all space chars 1.0 0.70 0.90 | -------------------------------------------------------------- | | Compiled with 'ghc --make -O2': | -------------------------------------------------------------- | Input isSpace_DataChar isSpace_Pattern isSpace_Alt | --------------- ---------------- --------------- ----------- | ascii text 1.0 0.93 0.40 | greek text 1.0 0.98 0.99 | haskell code 1.0 1.03 0.58 | chars 0..255 1.0 0.92 0.62 | all space chars 1.0 0.88 0.92 | -------------------------------------------------------------- | | My benchmark program can be found here: | https://gist.github.com/3967761 | | I'd like to propose that we consider replacing the definition | of isSpace with isSpace_Alt. | | John | | | _______________________________________________ | Libraries mailing list | Libraries@haskell.org | http://www.haskell.org/mailman/listinfo/libraries

+++ Simon Peyton-Jones [Oct 29 12 22:29 ]:
Sounds good to me. Thanks for doing this.
When you think you are ready, just submit a patch. (As others have noted, maybe isSpace isn't the only function that could benefit from this kind of attention.)
Yes, if the general idea is agreeable, I'll do some of the other functions in GHC.Unicode as well, and provide benchmarks for them as well.

I've done some further investigating. The large differences I was seeing on OSX went away when I moved to a linux machine, and also when I compiled the benchmark with -O3. (When I do either of those things, the puzzling difference between isSpace_DataChar and Data.Char.isSpace also goes away.) I don't know enough about core to figure out what is going on here, but I'll assume that the benchmarks I'm getting on the linux box are the sober ones. They look like this (the number is the ratio new code time / old code time for isSpace): Benchmark compiled without optimization: ascii text 0.71 ascii text (short lines) 0.74 ascii text (long lines) 0.72 Greek text 1.08 Haskell code 0.77 chars 0..255 0.70 all spaces 1.02 Benchmark compiled with -O2: ascii text 0.69 ascii text (short lines) 0.72 ascii text (long lines) 0.69 Greek text 1.11 Haskell code 0.77 chars 0..255 0.69 all spaces 0.94 This suggests that we can get a modest improvement for the most common cases if we adopt the new definition of isSpace. However, performance might actually decrease slightly for non-latin text. The changes I tried for other functions in GHC.Unicode did not result in significant improvements. So, the question is whether it's worth submitting the patch for isSpace, given that the gains are more modest than I'd reported before. (Note that 'words' will also be affected by this, as it uses isSpace.) I have attached the proposed patch to this email. John +++ John MacFarlane [Oct 29 12 16:15 ]:
+++ Simon Peyton-Jones [Oct 29 12 22:29 ]:
Sounds good to me. Thanks for doing this.
When you think you are ready, just submit a patch. (As others have noted, maybe isSpace isn't the only function that could benefit from this kind of attention.)
Yes, if the general idea is agreeable, I'll do some of the other functions in GHC.Unicode as well, and provide benchmarks for them as well.
_______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries

On Tue, Oct 30, 2012 at 7:33 PM, John MacFarlane
So, the question is whether it's worth submitting the patch for isSpace, given that the gains are more modest than I'd reported before. (Note that 'words' will also be affected by this, as it uses isSpace.) I have attached the proposed patch to this email.
I say go for it. You probably want to attach the patch to a ticket on Trac so it doesn't get lost. -- Johan
participants (10)
-
Brandon Allbery
-
Edward A Kmett
-
Ivan Lazar Miljenovic
-
Joachim Breitner
-
Johan Tibell
-
John MacFarlane
-
Patrick Palka
-
Simon Peyton-Jones
-
Vincent Hanquez
-
wren ng thornton