bug in Prelude.words?

older
ANNOUNCEMENT: nehe-tuts 0.2.0, new...

malcolm.wallace

28 Mar 2011 28 Mar '11

3:55 p.m.

Does anyone else think it odd that Prelude.words will break a string at a non-breaking space? Prelude> words "abc def\xA0ghi" ["abc","def","ghi"] I would have expected this to be the obvious behaviour: Prelude> words "abc def\xA0ghi" ["abc","def\160ghi"] Regards, Malcolm

Attachments:

attachment.html (text/html — 916 bytes)

Show replies by date

Colin Adams

28 Mar 28 Mar

4:02 p.m.

It doesn't seem odd to me. Consider an HTML page with that "sentence" displayed on it. If you ask the viewer of the page how many words are in the sentence, then surely you will get the answer 3? On 28 March 2011 16:55, malcolm.wallace wrote:

...

Does anyone else think it odd that Prelude.words will break a string at a non-breaking space?

Prelude> words "abc def\xA0ghi" ["abc","def","ghi"]

I would have expected this to be the obvious behaviour:

Prelude> words "abc def\xA0ghi" ["abc","def\160ghi"]

Regards, Malcolm

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Colin Adams Preston, Lancashire, ENGLAND () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments

malcolm.wallace

4:20 p.m.

Consider an HTML page with that "sentence" displayed on it. If you ask the viewer of the page how many words are in the sentence, then surely you will get the answer 3? But what about the author? Surely there is no reason to use a non-breaking space unless they intend it to mean that the characters before and after it belong to the same logical unit-of-comprehension? Regards, Malcolm

Nick Bowler

4:59 p.m.

On 2011-03-28 16:20 +0000, malcolm.wallace wrote:

...

But what about the author? Surely there is no reason to use a non-breaking space unless they intend it to mean that the characters before and after it belong to the same logical unit-of-comprehension?

The "non-breaking" part of non-breaking space refers to breaking text into lines. In other words, if two words are separated by a non-breaking space, then a line break will not be put between those words. A non-breaking space does *not* make two words into one word. -- Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

Thomas Davie

5:51 p.m.

On 28 Mar 2011, at 17:20, malcolm.wallace wrote:

...

...
Consider an HTML page with that "sentence" displayed on it. If you ask the viewer of the page how many words are in the sentence, then surely you will get the answer 3?

But what about the author? Surely there is no reason to use a non-breaking space unless they intend it to mean that the characters before and after it belong to the same logical unit-of-comprehension?

I'm not sure that a logical unit-of-comprehension is the same as a word though. As an aside – in publishing non-breaking spaces are commonly used for other purposes too, for example forcing a word onto a certain line to stop a space river appearing in a paragraph. Bob

Christopher Done

4:05 p.m.

On 28 March 2011 17:55, malcolm.wallace wrote:

...

Does anyone else think it odd that Prelude.words will break a string at a non-breaking space?

Prelude> words "abc def\xA0ghi" ["abc","def","ghi"]

I think it's predictable, isSpace (which words is based on) is based on generalCategory, which returns the proper Unicode category: λ> generalCategory '\xa0' Space So: -- | Selects white-space characters in the Latin-1 range.-- (In Unicode terms, this includes spaces and some control characters.)isSpace :: Char -> Bool-- isSpace includes non-breaking space-- Done with explicit equalities both for efficiency, and to avoid a tiresome-- recursion with GHC.List elemisSpace c = c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\f' || c == '\v' || c == '\xa0' || iswspace (fromIntegral (ord c)) /= 0

James Cook

4:24 p.m.

On Mar 28, 2011, at 12:05 PM, Christopher Done wrote:

...

On 28 March 2011 17:55, malcolm.wallace wrote: Does anyone else think it odd that Prelude.words will break a string at a non-breaking space?

Prelude> words "abc def\xA0ghi" ["abc","def","ghi"]

I think it's predictable, isSpace (which words is based on) is based on generalCategory, which returns the proper Unicode category:

λ> generalCategory '\xa0' Space

I agree, and I also agree that it would make sense the other way (not breaking on non-breaking spaces). Perhaps it would be a good idea to add a remark to the documentation which specifies the treatment of non- breaking spaces. -- James

malcolm.wallace

4:53 p.m.

I think it's predictable, isSpace (which words is based on) is based on generalCategory, which returns the proper Unicode category: λ> generalCategory '\xa0' Space I agree, and I also agree that it would make sense the other way (not breaking on non-breaking spaces). Perhaps it would be a good idea to add a remark to the documentation which specifies the treatment of non-breaking spaces. I note that Java has two distinct properties concerning whitespace: Character.isSpaceChar('\xA0') == True Character.isWhitespace('\xA0') == False Contrast with -- \x20 is ASCII space Character.isSpaceChar('\x20') == True Character.isWhitespace('\x20') == True -- \x2060 is the word-joiner (zero-width non-breaking space) Character.isSpaceChar('\x2060') == False Character.isWhitespace('\x2060') == False -- \x202F is the narrow non-breaking space Character.isSpaceChar('\x202F') == True Character.isWhitespace('\x202F') == False -- \x2009 is the thin space Character.isSpaceChar('\x2009') == True CharacterisWhitespace('\x2009') == True

5216

Age (days ago)

5216

Last active (days ago)

List overview

Download

7 comments

6 participants

participants (6)

Christopher Done
Colin Adams
James Cook
malcolm.wallace
Nick Bowler
Thomas Davie