
On Sunday 13 June 2010 08:00:15, Erik de Castro Lopo wrote:
HI all,
I've managed to use the Curl bindings to pull down a web page, and I'm using TagSoup to parse it, but when I try to print the text in a TagText I get
hPutChar: invalid argument (Invalid or incomplete multibyte or wide character)
The code looks like:
parsePage :: String -> IO () parsePage page = do let tags = map deTag $ filter isTagText $ parseTags page mapM_ putStrLn tags where deTag (TagText s) = s deTag x = error $ "Bad Tag '" ++ show x ++ "' in deTag."
This is with ghc-6.12.1 on Debian Linux.
Any clues appreciated.
Cheers, Erik
Probably the page you've tried it on wasn't encoded in your locale encoding. If the page was in latin1 and your locale is UTF-8, there will likely be invalid (for UTF-8) byte sequences in it. For a locally stored page, the code above worked fine with tagsoup-0.6 and tagsoup-0.10 when the page was utf-8-encoded, but if it was latin1-encoded (and contained non-ASCII chars), it raised an invalid argument (Invalid or incomplete multibyte or wide character) error (on hGetContents, though, I suppose that's because I used readFile and not th Curl-bindings).