bug in Prelude.words?

Does anyone else think it odd that Prelude.words will break a string at a non-breaking space? Prelude> words "abc def\xA0ghi" ["abc","def","ghi"] I would have expected this to be the obvious behaviour: Prelude> words "abc def\xA0ghi" ["abc","def\160ghi"] Regards, Malcolm

It doesn't seem odd to me.
Consider an HTML page with that "sentence" displayed on it. If you ask the
viewer of the page how many words are in the sentence, then surely you will
get the answer 3?
On 28 March 2011 16:55, malcolm.wallace
Does anyone else think it odd that Prelude.words will break a string at a non-breaking space?
Prelude> words "abc def\xA0ghi" ["abc","def","ghi"]
I would have expected this to be the obvious behaviour:
Prelude> words "abc def\xA0ghi" ["abc","def\160ghi"]
Regards, Malcolm
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Colin Adams Preston, Lancashire, ENGLAND () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

Consider an HTML page with that "sentence" displayed on it. If you ask the viewer of the page how many words are in the sentence, then surely you will get the answer 3? But what about the author? Surely there is no reason to use a non-breaking space unless they intend it to mean that the characters before and after it belong to the same logical unit-of-comprehension? Regards, Malcolm

On 2011-03-28 16:20 +0000, malcolm.wallace wrote:
But what about the author? Surely there is no reason to use a non-breaking space unless they intend it to mean that the characters before and after it belong to the same logical unit-of-comprehension?
The "non-breaking" part of non-breaking space refers to breaking text into lines. In other words, if two words are separated by a non-breaking space, then a line break will not be put between those words. A non-breaking space does *not* make two words into one word. -- Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

On 28 Mar 2011, at 17:20, malcolm.wallace wrote:
Consider an HTML page with that "sentence" displayed on it. If you ask the viewer of the page how many words are in the sentence, then surely you will get the answer 3?
But what about the author? Surely there is no reason to use a non-breaking space unless they intend it to mean that the characters before and after it belong to the same logical unit-of-comprehension?
I'm not sure that a logical unit-of-comprehension is the same as a word though. As an aside – in publishing non-breaking spaces are commonly used for other purposes too, for example forcing a word onto a certain line to stop a space river appearing in a paragraph. Bob

On 28 March 2011 17:55, malcolm.wallace
Does anyone else think it odd that Prelude.words will break a string at a non-breaking space?
Prelude> words "abc def\xA0ghi" ["abc","def","ghi"]
I think it's predictable, isSpace (which words is based on) is based on generalCategory, which returns the proper Unicode category: λ> generalCategory '\xa0' Space So: -- | Selects white-space characters in the Latin-1 range.-- (In Unicode terms, this includes spaces and some control characters.)isSpace :: Char -> Bool-- isSpace includes non-breaking space-- Done with explicit equalities both for efficiency, and to avoid a tiresome-- recursion with GHC.List elemisSpace c = c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\f' || c == '\v' || c == '\xa0' || iswspace (fromIntegral (ord c)) /= 0

On Mar 28, 2011, at 12:05 PM, Christopher Done wrote:
On 28 March 2011 17:55, malcolm.wallace
wrote: Does anyone else think it odd that Prelude.words will break a string at a non-breaking space? Prelude> words "abc def\xA0ghi" ["abc","def","ghi"]
I think it's predictable, isSpace (which words is based on) is based on generalCategory, which returns the proper Unicode category:
λ> generalCategory '\xa0' Space
I agree, and I also agree that it would make sense the other way (not breaking on non-breaking spaces). Perhaps it would be a good idea to add a remark to the documentation which specifies the treatment of non- breaking spaces. -- James

I think it's predictable, isSpace (which words is based on) is based on generalCategory, which returns the proper Unicode category: λ> generalCategory '\xa0' Space I agree, and I also agree that it would make sense the other way (not breaking on non-breaking spaces). Perhaps it would be a good idea to add a remark to the documentation which specifies the treatment of non-breaking spaces. I note that Java has two distinct properties concerning whitespace: Character.isSpaceChar('\xA0') == True Character.isWhitespace('\xA0') == False Contrast with -- \x20 is ASCII space Character.isSpaceChar('\x20') == True Character.isWhitespace('\x20') == True -- \x2060 is the word-joiner (zero-width non-breaking space) Character.isSpaceChar('\x2060') == False Character.isWhitespace('\x2060') == False -- \x202F is the narrow non-breaking space Character.isSpaceChar('\x202F') == True Character.isWhitespace('\x202F') == False -- \x2009 is the thin space Character.isSpaceChar('\x2009') == True CharacterisWhitespace('\x2009') == True
participants (6)
-
Christopher Done
-
Colin Adams
-
James Cook
-
malcolm.wallace
-
Nick Bowler
-
Thomas Davie