HXT and xhtml page encoded in cp1251

Greetings, I'm writing a small webcrawler. Usually I used tagsoup for such tasks but this time I decided to give hxt a try. Unfortunately, I ran into the troubles with character encodings. The site I'm targeting uses cp1251, which is the one of the most popular among sites in Russian. Pages contain the following meta tag <meta http-equiv="content-type" content="text/html; charset=windows-1251" /> The readDocument arrow fails with the following message: fatal error: encoding scheme not supported: "WINDOWS-1251" Can someone suggest a workaround for my use case? Best regards, Dmitry

Since the document claims it is HTML, you should be parsing it with an HTML parser. Try hxt-tagsoup -- specifically, the "parseHtmlTagSoup" arrow.

On 11-04-18 05:06 PM, Dmitry V'yal wrote:
The readDocument arrow fails with the following message:
fatal error: encoding scheme not supported: "WINDOWS-1251"
Can someone suggest a workaround for my use case?
If you have a Handle (from file or Network for example), import System.IO(hGetContents, hSetEncoding, mkTextEncoding) import Text.XML.HXT.Core do e <- mkTextEncoding "WINDOWS-1251" -- or "CP1251" depending on OS hSetEncoding your'handle e s <- hGetContents your'handle t <- runX (readString [...] s >>> ...) ... If you don't have a Handle but a ByteString (from Network.HTTP for example), dump it into a file first, then use the above.
participants (3)
-
Albert Y. C. Lai
-
Dmitry V'yal
-
John Millikin