[tagsoup] is it the expected behaviour ?

Hi, Experimenting with tagsoup (I'm using GHC 6.8.2 and tagsoup-0.6), I found something which appears to me as strange behaviour : when parsing tag's attributes that have spaces enclosing the "=" sign, tagsoup seems to interpret these as empty attributes' names and values. For instance (notice the spaces enclosing the equal sign) : $ ghcii.sh -package tagsoup [...] Loading package tagsoup-0.6 ... linking ... done. Prelude> :m +Text.HTML.TagSoup Prelude Text.HTML.TagSoup> parseTags "uh ?</a>" [TagOpen "a" [("href",""),("","what")],TagText "uh ?",TagClose "a"] Here, am I wrong when expecting [TagOpen "a" [("href","what")],TagText "uh ?",TagClose "a"], or is there some HTML interpretation I don't know on the parsing of attributes ? Sincerely yours, Fernand

Hi Fernand,
Experimenting with tagsoup (I'm using GHC 6.8.2 and tagsoup-0.6), I found something which appears to me as strange behaviour : when parsing tag's attributes that have spaces enclosing the "=" sign, tagsoup seems to interpret these as empty attributes' names and values.
Yep, that's the wrong behaviour. I'm just writing a patch now, and will email back once its in the development version. Many thanks for reporting the bug - it will go in the regression test and will never be broken again :-) Thanks Neil

Hi Fernand,
Using the darcs version:
*Text.HTML.TagSoup> parseTags "uh ?</a>"
[TagOpen "a" [("href","what")],TagText "uh ?",TagClose "a"]
Which you can get from: http://www.cs.york.ac.uk/fp/darcs/tagsoup
This will be bundled up in the next release. If the problem is more
urgent for you, let me know and I'll release a new version. I
appreciate any bugs, or just weird things tagsoup does on malformed
HTML, so I can build up a more comprehensive regression suite.
Thanks
Neil
On Mon, May 19, 2008 at 11:17 AM, Neil Mitchell
Hi Fernand,
Experimenting with tagsoup (I'm using GHC 6.8.2 and tagsoup-0.6), I found something which appears to me as strange behaviour : when parsing tag's attributes that have spaces enclosing the "=" sign, tagsoup seems to interpret these as empty attributes' names and values.
Yep, that's the wrong behaviour. I'm just writing a patch now, and will email back once its in the development version.
Many thanks for reporting the bug - it will go in the regression test and will never be broken again :-)
Thanks
Neil

Fernand
Experimenting with tagsoup (I'm using GHC 6.8.2 and tagsoup-0.6), I found something which appears to me as strange behaviour : when parsing tag's attributes that have spaces enclosing the "=" sign, tagsoup seems to interpret these as empty attributes' names and values. For instance (notice the spaces enclosing the equal sign) :
I don't think that is legal XML. The definitions of STag and Attribute from http://www.w3.org/TR/xml11/#NT-STag are: [40] STag ::= '<' Name (S Attribute)* S? '>' [41] Attribute ::= Name Eq AttValue And 'S' represents one or more whitespace characters, so it seems clear that they are not allowed between Name, Eq, and AttValue. Whether this is the right behavior for TagSoup, which is styled as a fast-and-loose XML/HTML processor, is another matter. -k -- If I haven't seen further, it is by standing in the footprints of giants

Hi Ketil,
I don't think that is legal XML. The definitions of STag and Attribute from http://www.w3.org/TR/xml11/#NT-STag are:
[40] STag ::= '<' Name (S Attribute)* S? '>' [41] Attribute ::= Name Eq AttValue
And 'S' represents one or more whitespace characters, so it seems clear that they are not allowed between Name, Eq, and AttValue.
Whether this is the right behavior for TagSoup, which is styled as a fast-and-loose XML/HTML processor, is another matter.
It seems that both Firefox and IE accept the attribute values with spaces around the equals, so I think that's a sensible choice for tagsoup. I must confess I haven't actually read the XML definition in the last 5 years, but probably should! Thanks Neil

Ketil Malde wrote:
I don't think that is legal XML. The definitions of STag and Attribute from http://www.w3.org/TR/xml11/#NT-STag are:
[40] STag ::= '<' Name (S Attribute)* S? '>' [41] Attribute ::= Name Eq AttValue
And 'S' represents one or more whitespace characters, so it seems clear that they are not allowed between Name, Eq, and AttValue.
indeed it's legal because of rule [25] (http://www.w3.org/TR/xml11/#NT-Eq) [25] Eq ::= S? '=' S? and rule [3] [3] S ::= (#x20 | #x9 | #xD | #xA)+ Cheers, Uwe
participants (4)
-
Fernand
-
Ketil Malde
-
Neil Mitchell
-
Uwe Schmidt