Simple UTF-8 output - Beginners - Haskell.org

newer
Data.Foldable/foldMap question

Simple UTF-8 output

older
How does one uninstall a Hackage...

Christopher Howard

27 May 2011 27 May '11

4:54 a.m.

Hi. I'm trying to print out UTF-8 strings (pulled from a MySQL database through HDBC.ODBC). If I just print them out, the extended characters are printed out as question marks. If I use show on them (which is what I want to do in this case) they are printed as \NNNNN... (presumably the Unicode number). Google brought up hundreds of topics for "Haskell utf-8" but its like looking for a single tree in a forest. I took great care to ensure my Gentoo Linux system is UTF-8 ready, and I can type or echo utf-8 characters on the command line just fine. # locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= -- frigidcode.com theologia.indicium.us

Reply

Sign in to reply online Use email software

Show replies by date

Elvio Toccalino

27 May 27 May

4:59 a.m.

I usually pull such values as ByteStrings. The utf8-strings package provides printing functions for such data. Also, if you'd like to work on it as Text, there's the Data.Text.Encodings module, to decodeUTF8 or decodeASCII ByteString's into Text. On Thu, 2011-05-26 at 20:54 -0800, Christopher Howard wrote:

Hi. I'm trying to print out UTF-8 strings (pulled from a MySQL database through HDBC.ODBC). If I just print them out, the extended characters are printed out as question marks. If I use show on them (which is what I want to do in this case) they are printed as \NNNNN... (presumably the Unicode number).

Google brought up hundreds of topics for "Haskell utf-8" but its like looking for a single tree in a forest.

I took great care to ensure my Gentoo Linux system is UTF-8 ready, and I can type or echo utf-8 characters on the command line just fine.

# locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

Reply

Sign in to reply online Use email software

Erik de Castro Lopo

7:38 a.m.

Elvio Toccalino wrote:

I usually pull such values as ByteStrings. The utf8-strings package provides printing functions for such data. Also, if you'd like to work on it as Text, there's the Data.Text.Encodings module, to decodeUTF8 or decodeASCII ByteString's into Text.

I would recommend Data.Text because that is designed for Text encodings while ByteString is designed for handling arrays of bytes (eg things like data read from an network socket). Cheers, Erik -- ---------------------------------------------------------------------- Erik de Castro Lopo http://www.mega-nerd.com/

Reply

Sign in to reply online Use email software

Christopher Howard

28 May 28 May

4:01 a.m.

On 05/26/2011 11:38 PM, Erik de Castro Lopo wrote:

Elvio Toccalino wrote:

...
I usually pull such values as ByteStrings. The utf8-strings package provides printing functions for such data. Also, if you'd like to work on it as Text, there's the Data.Text.Encodings module, to decodeUTF8 or decodeASCII ByteString's into Text.

I would recommend Data.Text because that is designed for Text encodings while ByteString is designed for handling arrays of bytes (eg things like data read from an network socket).

Cheers, Erik

Thank you everyone for your responses, but I'm still having trouble with this issue. This is what I've learned so far: - The problem is not my terminal: commands like putStrLn "\228" produce the correct character. - The problem doesn't seem to be my MySQL database. The character encoding on the database and on the table are set to UTF-8. I can view the characters just fine when in the MySQL client. - I looked again at the character numbers on the special characters being outputted by my program: They are not the correct Unicode numbers. Apparently the special characters are somehow getting mangled when I pull them from the database. - The choice to put the query results into ByteStrings was not mine, that is just what is in the results list that quickQuery gives me. I tried converting to Text and then to String by using "unpack (decodeUtf8 byte_str)"; however, when the program goes to print its first line with an extended character, it just throws the exception: "*** Exception: Cannot decode byte '\xe6': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream". -- frigidcode.com theologia.indicium.us

Reply

Sign in to reply online Use email software

Christopher Howard

4:35 a.m.

On 05/27/2011 08:01 PM, Christopher Howard wrote:

On 05/26/2011 11:38 PM, Erik de Castro Lopo wrote:

...
Elvio Toccalino wrote:

...
I usually pull such values as ByteStrings. The utf8-strings package provides printing functions for such data. Also, if you'd like to work on it as Text, there's the Data.Text.Encodings module, to decodeUTF8 or decodeASCII ByteString's into Text.

I would recommend Data.Text because that is designed for Text encodings while ByteString is designed for handling arrays of bytes (eg things like data read from an network socket).

Cheers, Erik

Thank you everyone for your responses, but I'm still having trouble with this issue. This is what I've learned so far:

- The problem is not my terminal: commands like putStrLn "\228" produce the correct character. - The problem doesn't seem to be my MySQL database. The character encoding on the database and on the table are set to UTF-8. I can view the characters just fine when in the MySQL client. - I looked again at the character numbers on the special characters being outputted by my program: They are not the correct Unicode numbers. Apparently the special characters are somehow getting mangled when I pull them from the database. - The choice to put the query results into ByteStrings was not mine, that is just what is in the results list that quickQuery gives me. I tried converting to Text and then to String by using "unpack (decodeUtf8 byte_str)"; however, when the program goes to print its first line with an extended character, it just throws the exception: "*** Exception: Cannot decode byte '\xe6': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream".

Update: I think this may be a problem with the myodbc5 driver. If I connect with isql and then select the data, the results with extended characters have the text either mangled or replaced with "***ERROR***". So I am thinking either this MySQL driver for unixODBC either is not able to handle the UTF-8, or (more likely) I don't have the correct options in my .odbc.ini config file. -- frigidcode.com theologia.indicium.us

Reply

Sign in to reply online Use email software

Elvio Toccalino

4:56 p.m.

In case your ini file is not the problem, check the output of your conversion piped (mentioned in the other mail: Data.Text.IO.putStrLn $ Data.Text.Encodings.decodeUtf8 $ bytestringResult) using decodeASCII, just to see what comes out. BTW, the \xe6 is the UTF8 code for the small letter AE, which I'm guessing is not something you'd use in a query string :P Unless you're storing some very cosmopolitan data in your database, I¿d say someone is messing your codes. So far, this hasn't happened to me using the Haskell libraries recommended here. On Fri, 2011-05-27 at 20:35 -0800, Christopher Howard wrote:

On 05/27/2011 08:01 PM, Christopher Howard wrote:

...
On 05/26/2011 11:38 PM, Erik de Castro Lopo wrote:

...
Elvio Toccalino wrote:

...
I usually pull such values as ByteStrings. The utf8-strings package provides printing functions for such data. Also, if you'd like to work on it as Text, there's the Data.Text.Encodings module, to decodeUTF8 or decodeASCII ByteString's into Text.

I would recommend Data.Text because that is designed for Text encodings while ByteString is designed for handling arrays of bytes (eg things like data read from an network socket).

Cheers, Erik

Thank you everyone for your responses, but I'm still having trouble with this issue. This is what I've learned so far:

- The problem is not my terminal: commands like putStrLn "\228" produce the correct character. - The problem doesn't seem to be my MySQL database. The character encoding on the database and on the table are set to UTF-8. I can view the characters just fine when in the MySQL client. - I looked again at the character numbers on the special characters being outputted by my program: They are not the correct Unicode numbers. Apparently the special characters are somehow getting mangled when I pull them from the database. - The choice to put the query results into ByteStrings was not mine, that is just what is in the results list that quickQuery gives me. I tried converting to Text and then to String by using "unpack (decodeUtf8 byte_str)"; however, when the program goes to print its first line with an extended character, it just throws the exception: "*** Exception: Cannot decode byte '\xe6': Data.Text.Encoding.decodeUtf8: Invalid UTF-8 stream".

Update: I think this may be a problem with the myodbc5 driver. If I connect with isql and then select the data, the results with extended characters have the text either mangled or replaced with "***ERROR***". So I am thinking either this MySQL driver for unixODBC either is not able to handle the UTF-8, or (more likely) I don't have the correct options in my .odbc.ini config file.

Reply

Sign in to reply online Use email software

Christopher Howard

5:58 p.m.

On 05/28/2011 08:56 AM, Elvio Toccalino wrote:

In case your ini file is not the problem, check the output of your conversion piped (mentioned in the other mail: Data.Text.IO.putStrLn $ Data.Text.Encodings.decodeUtf8 $ bytestringResult) using decodeASCII, just to see what comes out.

BTW, the \xe6 is the UTF8 code for the small letter AE, which I'm guessing is not something you'd use in a query string :P Unless you're storing some very cosmopolitan data in your database, I¿d say someone is messing your codes. So far, this hasn't happened to me using the Haskell libraries recommended here.

Actually, e6 was the correct UTF-8 hex value (ligature æ). Used as part of a book title. To be honest, I finally gave up on Database.HDBC.ODBC and tried the experimental Database.HDBC.MySQL. So far it appears to be getting the Unicode right. -- frigidcode.com theologia.indicium.us

Reply

Sign in to reply online Use email software

Elvio Toccalino

6:39 p.m.

On 05/28/2011 08:56 AM, Elvio Toccalino wrote:

...
BTW, the \xe6 is the UTF8 code for the small letter AE, which I'm guessing is not something you'd use in a query string :P

On Sat, 2011-05-28 at 09:58 -0800, Christopher Howard wrote:

Actually, e6 was the correct UTF-8 hex value (ligature æ). Used as part of a book title.

Aggressive inference can be a bitch sometimes. Keep reporting on you're adventures, Chris. And if you keep having problems with a library, suspecting it's a bug, please let the right people know about it.

Reply

Sign in to reply online Use email software

Christian Maeder

27 May 27 May

9:51 a.m.

Am 27.05.2011 06:54, schrieb Christopher Howard:

Hi. I'm trying to print out UTF-8 strings (pulled from a MySQL database through HDBC.ODBC). If I just print them out, the extended characters are printed out as question marks.

In this case the character encoding of your display device (terminal) is still wrong. ghci Prelude> putStrLn "\228" ä should work. Christian

If I use show on them (which is what I want to do in this case) they are printed as \NNNNN... (presumably the Unicode number).

Google brought up hundreds of topics for "Haskell utf-8" but its like looking for a single tree in a forest.

I took great care to ensure my Gentoo Linux system is UTF-8 ready, and I can type or echo utf-8 characters on the command line just fine.

# locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=

Reply

Sign in to reply online Use email software

5152

Age (days ago)

5153

Last active (days ago)

Download

8 comments

4 participants

tags

participants (4)

Christian Maeder
Christopher Howard
Elvio Toccalino
Erik de Castro Lopo