Parse CSV / TSV file in Haskell - Unicode Characters

Hi, I'm trying to parse a tab-delimited file using cassava/Data.Csv in Haskell. However, I get problems if there are "strange" (Unicode) characters in my CSV file. I'll get a parse error (endOfInput) then. According to the command-line tool "file", my file has a "UTF-8 Unicode text" decoding. My Haskell code looks like this: {-# LANGUAGE ScopedTypeVariables #-} {-# LANGUAGE OverloadedStrings #-} import qualified Data.ByteString as C import qualified System.IO.UTF8 as U import qualified Data.ByteString.UTF8 as UB import qualified Data.ByteString.Lazy.Char8 as DL import qualified Codec.Binary.UTF8.String as US import qualified Data.Text.Lazy.Encoding as EL import qualified Data.ByteString.Lazy as L import Data.Text.Encoding as E -- Handle CSV / TSV files with ... import Data.Csv import qualified Data.Vector as V import Data.Char -- ord csvFile :: FilePath csvFile = "myFile.txt" -- Set delimiter to \t (tabulator) myOptions = defaultDecodeOptions { decDelimiter = fromIntegral (ord '\t') } main :: IO () main = do csvData <- L.readFile csvFile case EL.decodeUtf8' csvData of Left err -> print err Right dat -> case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of Left err -> putStrLn err Right v -> V.forM_ v $ \ (category :: String , user :: String , date :: String, time :: String, message :: String) -> do print message I tried using decodingUtf8', preprocessing (filtering) the input with predicates from Data.Char, and much more. However the endOfFile error persists. My CSV-file looks like this: a - - - RT USE " Kenny" ⢠Hahahahahahahahaha. #Emmen #Brandstapel a - - - Uhm .. wat dan ook ????!!!! đ Or more literally: a\t-\t-\t-\tRT USE " Kenny" ⢠Hahahahahahahahaha. #Emmen #Brandstapel a\t-\t-\t-\tUhm .. wat dan ook ????!!!! đ The problem chars are the đ and ⢠(and in my complete file, there are many more of similar characters). What can I do, so that cassava / Data.Csv can read my file properly? I've also posted this question at StackOverflow a few days ago: http://stackoverflow.com/questions/26499831/parse-csv-tsv-file-in-haskell-un... Best, Volker

I just replied to your question on SO.
On Fri, Oct 24, 2014 at 3:35 AM, Volker Strobel
Hi,
I'm trying to parse a tab-delimited file using cassava/Data.Csv in Haskell. However, I get problems if there are "strange" (Unicode) characters in my CSV file. I'll get a parse error (endOfInput) then.
According to the command-line tool "file", my file has a "UTF-8 Unicode text" decoding. My Haskell code looks like this:
{-# LANGUAGE ScopedTypeVariables #-} {-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString as C import qualified System.IO.UTF8 as U import qualified Data.ByteString.UTF8 as UB import qualified Data.ByteString.Lazy.Char8 as DL import qualified Codec.Binary.UTF8.String as US import qualified Data.Text.Lazy.Encoding as EL import qualified Data.ByteString.Lazy as L
import Data.Text.Encoding as E
-- Handle CSV / TSV files with ... import Data.Csv import qualified Data.Vector as V
import Data.Char -- ord
csvFile :: FilePath csvFile = "myFile.txt"
-- Set delimiter to \t (tabulator) myOptions = defaultDecodeOptions { decDelimiter = fromIntegral (ord '\t') }
main :: IO () main = do csvData <- L.readFile csvFile case EL.decodeUtf8' csvData of Left err -> print err Right dat -> case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of Left err -> putStrLn err Right v -> V.forM_ v $ \ (category :: String , user :: String , date :: String, time :: String, message :: String) -> do print message
I tried using decodingUtf8', preprocessing (filtering) the input with predicates from Data.Char, and much more. However the endOfFile error persists.
My CSV-file looks like this:
a - - - RT USE " Kenny" ⢠Hahahahahahahahaha. #Emmen #Brandstapel a - - - Uhm .. wat dan ook ????!!!! đ
Or more literally:
a\t-\t-\t-\tRT USE " Kenny" ⢠Hahahahahahahahaha. #Emmen #Brandstapel a\t-\t-\t-\tUhm .. wat dan ook ????!!!! đ
The problem chars are the đ and ⢠(and in my complete file, there are many more of similar characters). What can I do, so that cassava / Data.Csv can read my file properly?
I've also posted this question at StackOverflow a few days ago: http://stackoverflow.com/questions/26499831/parse-csv-tsv-file-in-haskell-un...
Best, Volker
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (2)
-
Johan Tibell
-
Volker Strobel