Text.JSON and utf8

hi, tl;dr: i propose this patch to Text/JSON/String.hs and would like to know why it is needed: @@ -375,7 +375,7 @@ where go s1 = case s1 of - (x :xs) | x < '\x20' || x > '\x7e' -> '\\' : encControl x (go xs) + (x :xs) | x < '\x20' -> '\\' : encControl x (go xs) ('"' :xs) -> '\\' : '"' : go xs ('\\':xs) -> '\\' : '\\' : go xs (x :xs) -> x : go xs i recently stumbled upon CouchDB telling me i'm sending invalid json. i basically read lines from a utf8 file with german umlauts and send them to CouchDB using Text.JSON and Database.CouchDB. $ file lines.txt lines.txt: UTF-8 Unicode text lets take 'ö' as an example. i use LANG=de_DE.utf8 ghci tells
'ö' '\246'
putChar '\246' ö
putChar 'ö' ö
:m + Text.JSON Database.CouchDB runCouchDB' $ newNamedDoc (db "foo") (doc "bar") (showJSON $ toJSObject [("test","ö")]) *** Exception: HTTP/1.1 400 Bad Request Server: CouchDB/1.2.1 (Erlang OTP/R15B03) Date: Mon, 11 Feb 2013 13:24:49 GMT Content-Type: text/plain; charset=utf-8 Content-Length: 48 Cache-Control: must-revalidate
couchdb log says: Invalid JSON: {{error,{10,"lexical error: invalid bytes in UTF8 string.\n"}},<<"{\"test\":\"<F6>\"}">>} this is indeed hex ö:
:m + Numeric putChar $ toEnum $ fst $ head $ readHex "f6" ö
if i apply the above patch and reinstall JSON and CouchDB the doc creation works:
runCouchDB' $ newNamedDoc (db "db") (doc "foo") (showJSON $ toJSObject [("test", "ö")]) Right someRev
but i dont get back the ö i expected:
Just (_,_,x) <-runCouchDB' $ getDoc (db "foo") (doc "bar") :: IO (Maybe (Doc,Rev,JSObject String)) let Ok y = valFromObj "test" =<< readJSON x :: Result String y "\195\188" putStrLn y ü
apperently with curl everything works fine: $ curl localhost:5984/db/foo -XPUT -d '{"test": "ö"}' {"ok":true,"id":"foo","rev":"someOtherRev"} $ curl localhost:5984/db/foo {"_id":"bars","_rev":"someOtherRev","test":"ö"} so how can i get my precious ö back? what am i doing wrong or does Text.JSON need another patch? another question: why does encControl in Text/JSON/String.hs handle the cases x < '\x100' and x < '\x1000' even though they can never be reached with the old predicate in encJSString (x < '\x20') finally: is '\x7e' the right literal for the job? thanks for reading have fun martin

Don't use the json package, use aeson instead. (It's much faster and
handles encoding issues correctly).
G
On Mon, Feb 11, 2013 at 2:56 PM, Martin Hilbig
hi,
tl;dr: i propose this patch to Text/JSON/String.hs and would like to know why it is needed:
@@ -375,7 +375,7 @@ where go s1 = case s1 of - (x :xs) | x < '\x20' || x > '\x7e' -> '\\' : encControl x (go xs) + (x :xs) | x < '\x20' -> '\\' : encControl x (go xs) ('"' :xs) -> '\\' : '"' : go xs ('\\':xs) -> '\\' : '\\' : go xs (x :xs) -> x : go xs
i recently stumbled upon CouchDB telling me i'm sending invalid json.
i basically read lines from a utf8 file with german umlauts and send them to CouchDB using Text.JSON and Database.CouchDB.
$ file lines.txt lines.txt: UTF-8 Unicode text
lets take 'ö' as an example. i use LANG=de_DE.utf8
ghci tells
'ö' '\246'
putChar '\246' ö
putChar 'ö' ö
:m + Text.JSON Database.CouchDB runCouchDB' $ newNamedDoc (db "foo") (doc "bar") (showJSON $ toJSObject [("test","ö")]) *** Exception: HTTP/1.1 400 Bad Request Server: CouchDB/1.2.1 (Erlang OTP/R15B03) Date: Mon, 11 Feb 2013 13:24:49 GMT Content-Type: text/plain; charset=utf-8 Content-Length: 48 Cache-Control: must-revalidate
couchdb log says:
Invalid JSON: {{error,{10,"lexical error: invalid bytes in UTF8 string.\n"}},<<"{\"test\":\"<**F6>\"}">>}
this is indeed hex ö:
:m + Numeric putChar $ toEnum $ fst $ head $ readHex "f6" ö
if i apply the above patch and reinstall JSON and CouchDB the doc creation works:
runCouchDB' $ newNamedDoc (db "db") (doc "foo") (showJSON $ toJSObject [("test", "ö")]) Right someRev
but i dont get back the ö i expected:
Just (_,_,x) <-runCouchDB' $ getDoc (db "foo") (doc "bar") :: IO (Maybe (Doc,Rev,JSObject String)) let Ok y = valFromObj "test" =<< readJSON x :: Result String y "\195\188" putStrLn y ü
apperently with curl everything works fine:
$ curl localhost:5984/db/foo -XPUT -d '{"test": "ö"}' {"ok":true,"id":"foo","rev":"**someOtherRev"} $ curl localhost:5984/db/foo {"_id":"bars","_rev":"**someOtherRev","test":"ö"}
so how can i get my precious ö back? what am i doing wrong or does Text.JSON need another patch?
another question: why does encControl in Text/JSON/String.hs handle the cases x < '\x100' and x < '\x1000' even though they can never be reached with the old predicate in encJSString (x < '\x20')
finally: is '\x7e' the right literal for the job?
thanks for reading
have fun martin
______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafehttp://www.haskell.org/mailman/listinfo/haskell-cafe
--
Gregory Collins

Hello Martin,
the change that you propose seems to already be in json-0.7. Perhaps you
just need to 'cabal update' and install the most recent version?
About your other question: I have not used CouchDB but a common mistake is
to mix up strings and bytes. Perhaps the `getDoc` function does not do
utf-8 decoding and so it is giving you back list of bytes (as a String)?
In general, the JSON package only converts between JSON and String, and is
agnostic to what encoding is used to represent the strings. There are
other packages that convert Strings into bytes (e.g.,
http://hackage.haskell.org/package/utf8-string), so typically you want to
encode the string to bytes before you export it (say to CouchDB), and
decode it back into a string just after you've imported it.
-Iavor
On Mon, Feb 11, 2013 at 5:56 AM, Martin Hilbig
hi,
tl;dr: i propose this patch to Text/JSON/String.hs and would like to know why it is needed:
@@ -375,7 +375,7 @@ where go s1 = case s1 of - (x :xs) | x < '\x20' || x > '\x7e' -> '\\' : encControl x (go xs) + (x :xs) | x < '\x20' -> '\\' : encControl x (go xs) ('"' :xs) -> '\\' : '"' : go xs ('\\':xs) -> '\\' : '\\' : go xs (x :xs) -> x : go xs
i recently stumbled upon CouchDB telling me i'm sending invalid json.
i basically read lines from a utf8 file with german umlauts and send them to CouchDB using Text.JSON and Database.CouchDB.
$ file lines.txt lines.txt: UTF-8 Unicode text
lets take 'ö' as an example. i use LANG=de_DE.utf8
ghci tells
'ö' '\246'
putChar '\246' ö
putChar 'ö' ö
:m + Text.JSON Database.CouchDB runCouchDB' $ newNamedDoc (db "foo") (doc "bar") (showJSON $ toJSObject [("test","ö")]) *** Exception: HTTP/1.1 400 Bad Request Server: CouchDB/1.2.1 (Erlang OTP/R15B03) Date: Mon, 11 Feb 2013 13:24:49 GMT Content-Type: text/plain; charset=utf-8 Content-Length: 48 Cache-Control: must-revalidate
couchdb log says:
Invalid JSON: {{error,{10,"lexical error: invalid bytes in UTF8 string.\n"}},<<"{\"test\":\"<**F6>\"}">>}
this is indeed hex ö:
:m + Numeric putChar $ toEnum $ fst $ head $ readHex "f6" ö
if i apply the above patch and reinstall JSON and CouchDB the doc creation works:
runCouchDB' $ newNamedDoc (db "db") (doc "foo") (showJSON $ toJSObject [("test", "ö")]) Right someRev
but i dont get back the ö i expected:
Just (_,_,x) <-runCouchDB' $ getDoc (db "foo") (doc "bar") :: IO (Maybe (Doc,Rev,JSObject String)) let Ok y = valFromObj "test" =<< readJSON x :: Result String y "\195\188" putStrLn y ü
apperently with curl everything works fine:
$ curl localhost:5984/db/foo -XPUT -d '{"test": "ö"}' {"ok":true,"id":"foo","rev":"**someOtherRev"} $ curl localhost:5984/db/foo {"_id":"bars","_rev":"**someOtherRev","test":"ö"}
so how can i get my precious ö back? what am i doing wrong or does Text.JSON need another patch?
another question: why does encControl in Text/JSON/String.hs handle the cases x < '\x100' and x < '\x1000' even though they can never be reached with the old predicate in encJSString (x < '\x20')
finally: is '\x7e' the right literal for the job?
thanks for reading
have fun martin
______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafehttp://www.haskell.org/mailman/listinfo/haskell-cafe
participants (3)
-
Gregory Collins
-
Iavor Diatchki
-
Martin Hilbig