HDBC, character encoding

Hi, I wrote a CGI program to access a Postgres database using HDBC. The database stores books and I want to display those from a certain author. Everything works fine, unless I search for someone with an umlaut in his name. Böll, for example. I have a function like this
bookByAuthor :: Connection -> AutorName -> IO [[String]] bookByAuthor c aName = do writeFile "./err.log" ((show aName)++" "++(show $ toSql aName)) rows <- quickQuery c "SELECT * FROM buecher WHERE lower (autor_name) LIKE ? ORDER BY autor_name, buch_name" [toSql $ map toLower $ '%':aName++"%"] return $ map (map fromSql) rows
It returns me a SqlError. However, doing the same in ghci works perfectly. I can't understand why. err.log contains
"b\195\182ll" SqlString "b\195\182ll"
which is ok I think. Since
quickQuery c "SELECT * FROM buecher WHERE lower(autor_name) LIKE ? ORDER BY autor_name, buch_name" [toSql "%b\195\182%"]
works in ghci. I have tried "b\246ll", but that doesn't even work in ghci, although the database-encoding is utf-8. This all is really annoying... Adrian

aneumann:
Hi,
I wrote a CGI program to access a Postgres database using HDBC. The database stores books and I want to display those from a certain author. Everything works fine, unless I search for someone with an umlaut in his name. Böll, for example. I have a function like this
bookByAuthor :: Connection -> AutorName -> IO [[String]] bookByAuthor c aName = do writeFile "./err.log" ((show aName)++" "++(show $ toSql aName)) rows <- quickQuery c "SELECT * FROM buecher WHERE lower (autor_name) LIKE ? ORDER BY autor_name, buch_name" [toSql $ map toLower $ '%':aName++"%"] return $ map (map fromSql) rows
It returns me a SqlError. However, doing the same in ghci works perfectly. I can't understand why. err.log contains
"b\195\182ll" SqlString "b\195\182ll"
which is ok I think. Since
quickQuery c "SELECT * FROM buecher WHERE lower(autor_name) LIKE ? ORDER BY autor_name, buch_name" [toSql "%b\195\182%"]
works in ghci. I have tried "b\246ll", but that doesn't even work in ghci, although the database-encoding is utf-8. This all is really annoying...
Are you using the utf8-string package for necessary IO ? -- Don

2008/3/26 Adrian Neumann
Hi,
I wrote a CGI program to access a Postgres database using HDBC. The database stores books and I want to display those from a certain author. Everything works fine, unless I search for someone with an umlaut in his name. Böll, for example. I have a function like this
bookByAuthor :: Connection -> AutorName -> IO [[String]] bookByAuthor c aName = do writeFile "./err.log" ((show aName)++" "++(show $ toSql aName)) rows <- quickQuery c "SELECT * FROM buecher WHERE lower (autor_name) LIKE ? ORDER BY autor_name, buch_name" [toSql $ map toLower $ '%':aName++"%"] return $ map (map fromSql) rows
It returns me a SqlError. However, doing the same in ghci works perfectly. I can't understand why. err.log contains
"b\195\182ll" SqlString "b\195\182ll"
which is ok I think. Since
quickQuery c "SELECT * FROM buecher WHERE lower(autor_name) LIKE ? ORDER BY autor_name, buch_name" [toSql "%b\195\182%"]
works in ghci. I have tried "b\246ll", but that doesn't even work in ghci, although the database-encoding is utf-8. This all is really annoying...
I think that Peter Gammie (copied) has some code to deal with this. /Björn

On 31/03/2008, at 11:42 PM, Bjorn Bringert wrote:
2008/3/26 Adrian Neumann
: Hi,
I wrote a CGI program to access a Postgres database using HDBC. [...]
I think that Peter Gammie (copied) has some code to deal with this.
I added some stuff to HSQL (not HDBC) and HaskellDB so that UTF8 can be used to talk to the database. It's not very pretty though, so I haven't tried to get it merged. cheers peter

On 2008-04-01, Peter Gammie
On 31/03/2008, at 11:42 PM, Bjorn Bringert wrote:
2008/3/26 Adrian Neumann
: Hi,
I wrote a CGI program to access a Postgres database using HDBC. [...]
I think that Peter Gammie (copied) has some code to deal with this.
I added some stuff to HSQL (not HDBC) and HaskellDB so that UTF8 can be used to talk to the database. It's not very pretty though, so I haven't tried to get it merged.
Do you have a diff somewhere? I'd be interested in your approach, to see if the same thing would make sense in HDBC. I had generally been of the opinion that HDBC should just pass data through, however it is obtained, and let the programmer set the encoding as desired. If this approach isn't working for people, I'd like to fix it, but want to make sure it's done right.

On 03/04/2008, at 9:10 PM, John Goerzen wrote:
On 2008-04-01, Peter Gammie
wrote: I added some stuff to HSQL (not HDBC) and HaskellDB so that UTF8 can be used to talk to the database. It's not very pretty though, so I haven't tried to get it merged.
Do you have a diff somewhere? I'd be interested in your approach, to see if the same thing would make sense in HDBC.
My approach is simple: when creating a connection to the database, ask for two functions (a codec pair) and stash them with the rest of the DB state. The types involved are [Word8] and String. I started out trying to use ByteString instead of [Word8] but it didn't bring much to the table in my case.
I had generally been of the opinion that HDBC should just pass data through, however it is obtained, and let the programmer set the encoding as desired. If this approach isn't working for people, I'd like to fix it, but want to make sure it's done right.
Assuming you're talking about the FFI's mandated behaviour, I'm not very comfortable with the encoding depending on the C locale, for two main reasons: - the program I'm hacking runs on some web host's server that is an unknown quantity. - GHC has a few quirks in how it implements the FFI, better to avoid the part that interprets characters. If you want to be general, I guess you could abandon String (at the HDBC level) and use [Word8]/ByteString everywhere, then provide some convenience wrappers. Then you can uniformly handle binary and textual data. I am not familiar with your project though. cheers peter

On 2008-04-04, Peter Gammie
through, however it is obtained, and let the programmer set the encoding as desired. If this approach isn't working for people, I'd like to fix it, but want to make sure it's done right.
Assuming you're talking about the FFI's mandated behaviour, I'm not very comfortable with the encoding depending on the C locale, for two main reasons:
- the program I'm hacking runs on some web host's server that is an unknown quantity. - GHC has a few quirks in how it implements the FFI, better to avoid the part that interprets characters.
If you want to be general, I guess you could abandon String (at the HDBC level) and use [Word8]/ByteString everywhere, then provide some convenience wrappers. Then you can uniformly handle binary and textual data. I am not familiar with your project though.
I've looked at the Data.ByteString.Internal API, and it looks like that ought to work. Oddly, the Data.ByteString.Lazy.Internal API does not seem to export enough to work with it in FFI. I can see this being a performance and ease-of-use win in some situations. I don't think it's an actual feature difference, though. If you can represent it as a [Word8], you can represent it as a [Char], and it will be converted to the same underlying data for the trip through FFI. -- John

John Goerzen wrote:
I've looked at the Data.ByteString.Internal API, and it looks like that ought to work. Oddly, the Data.ByteString.Lazy.Internal API does not seem to export enough to work with it in FFI.
It doesn't usually make sense to use lazy ByteStrings directly with the FFI. Most often, you'll work with the strict chunks, and wrap them up. One exception would be for C functions that let you do scatter/gather I/O, such as writev. You'd still be dealing with strict chunks when calling the function, but the wrapper API might expose a lazy interface.

On 04/04/2008, at 9:27 PM, John Goerzen wrote:
I can see this being a performance and ease-of-use win in some situations. I don't think it's an actual feature difference, though. If you can represent it as a [Word8], you can represent it as a [Char], and it will be converted to the same underlying data for the trip through FFI.
Sure. I prefer to keep things separate: Char is for Unicode characters, [Word8]/ByteString is for concrete representations. It's more about type safety than efficiency. cheers peter
participants (6)
-
Adrian Neumann
-
Bjorn Bringert
-
Bryan O'Sullivan
-
Don Stewart
-
John Goerzen
-
Peter Gammie