character encoding problems with hxt (I think)

Dear Haskellers, I'm trying to write some code that grabs countries and provinces from the iso_3166 files on Linux systems. I seem to be running into some kind of character encoding problem. file says iso_3166_2.xml is a utf8 file, and isutf8 agrees, but when I run the following code, it crashes. uft8Copy makes a byte for byte copy as expected. noCrash read and writes the document without crashing, but the accented characters in the strings show up garbled. Just search for "DE" and you'll see what I mean. crash (on my system, (Debian testing)) produces the error message below. Can anyone enlighten me on what is going on? Thanks in advance. Henry Laxen ------------------------------------------------------------------------ {-# LANGUAGE Arrows #-} import Text.XML.HXT.Core import Data.List import qualified System.IO.UTF8 as U isoFile = "/usr/share/xml/iso-codes/iso_3166_2.xml" countZerosInLines = length . filter (\x -> x == '0') . concat utf8Copy = do xml <- U.readFile isoFile let doc = readString [withValidate no] xml U.writeFile "iso_written.xml" xml putStrLn "Ran utf8Copy" noCrash :: IO () noCrash = do lines <- runX (readDocument [withValidate no] isoFile >>> writeDocumentToString [withShowTree yes] ) -- mapM_ putStrLn lines print $ countZerosInLines lines putStrLn "Ran noCrash" crash :: IO () crash = do xml <- U.readFile isoFile let doc = readString [withValidate no] xml src <- runX $ doc //> writeDocumentToString [withShowTree yes] print $ countZerosInLines src putStrLn "Ran crash (didn't)" {- Ran utf8Copy 1236 Ran noCrash error: UTF-8 encoding error at input position 2640: InvalidLaterByte 1 error: UTF-8 encoding error at input position 2646: InvalidLaterByte 1 *** Exception: Enum.toEnum{Word8}: tag (363) is outside of bounds (0,255) -} main = do utf8Copy noCrash crash ------------------------------------------------------------------------
participants (1)
-
nadine.and.henry@pobox.com