Confused about ByteString, UTF8, Data.Text and sockets, still.

Hello all After reading the modules docs and some other discussions, I'm still not sure what's the best choice of tools for my problem. I'm looking at the scion server code base. At the moment, it's reading and writing on sockets using Lazy ByteStrings, then converting them to Haskell Strings using utf8-string. The Haskell Strings are then parsed as JSON using the JSon package. the response is in JSON, translated back with utf8-string to ByteStrings. This is efficient for small strings, but as I'm extending the API I have calls with much more data, and performance degrades significantly. Timings seem to point to the encoding of the String to UTF8. I have replaced JSon by AttoJson (there was also JSONb, which seems quite similar), which allows me to work solely with ByteStrings, bypassing the calls to utf8-string completely. Performance has improved noticeably. I'm worried that I've lost full UTF8 compatibility, though, haven't I? No double byte characters will work in that setup? Is Data.Text an alternative? Can I use that everywhere, including for dealing with sockets (the API only mentions Handle). Should I use Data.ByteString.UTF8 everywhere, rewriting the JSON parser to deal with this instead of the Word8 ByteStrings? In short, what's the fastest way to implement receiving/sending UTF8 text across sockets? Thanks for any pointer, -- JP Moresmau http://jpmoresmau.blogspot.com/

On Friday 03 September 2010 14:04:26, JP Moresmau wrote:
Hello all
After reading the modules docs and some other discussions, I'm still not sure what's the best choice of tools for my problem. I'm looking at the scion server code base. At the moment, it's reading and writing on sockets using Lazy ByteStrings, then converting them to Haskell Strings using utf8-string. The Haskell Strings are then parsed as JSON using the JSon package. the response is in JSON, translated back with utf8-string to ByteStrings. This is efficient for small strings, but as I'm extending the API I have calls with much more data, and performance degrades significantly. Timings seem to point to the encoding of the String to UTF8. I have replaced JSon by AttoJson (there was also JSONb, which seems quite similar), which allows me to work solely with ByteStrings, bypassing the calls to utf8-string completely. Performance has improved noticeably. I'm worried that I've lost full UTF8 compatibility, though, haven't I? No double byte characters will work in that setup?
That depends. I'm not familiar with JSON, but iirc, all delimiters are ASCII characters, so it could just work.
Is Data.Text an alternative? Can I use that everywhere, including for dealing with sockets (the API only mentions Handle). Should I use Data.ByteString.UTF8 everywhere, rewriting the JSON parser to deal with this instead of the Word8 ByteStrings?
Data.ByteString.UTF8 uses the ordinary Word8 ByteStrings, it just offers some functions to deal with UTF8 encoding.
In short, what's the fastest way to implement receiving/sending UTF8 text across sockets?
The fastest way of receiving/sending UTF8 text across sockets is, I strongly believe, ByteString. After all, UTF8 text is just a sequence of bytes (with special properties). It's what you do between receiving and sending where other methods might prove better. If you use Data.Text, you have to de/encode between UTF8 and UTF16 on receiving/sending. That won't be much faster than de/encoding between UTF8 and String, but Data.Text offers a better API for manipulating text than ByteString, so overall, it could be better. Depends on what your needs are, you'll have to try it out.
Thanks for any pointer,

On Fri, Sep 3, 2010 at 05:04, JP Moresmau
I have replaced JSon by AttoJson (there was also JSONb, which seems quite similar), which allows me to work solely with ByteStrings, bypassing the calls to utf8-string completely. Performance has improved noticeably. I'm worried that I've lost full UTF8 compatibility, though, haven't I? No double byte characters will work in that setup?
It should be easy enough to test; generate a file with non-ASCII characters in it and see if it's parsed correctly. I assume it will be, though you won't be able to perform String operations on the resulting decoded data unless you manually decode it. Slightly more worrisome is that AttoJson doesn't look like it works with non-UTF8 JSON -- you might have compatibility problems unless you implement manual decoding. I've written a binding to YAJL (a C-based JSON parser) which might be faster for you, if the input is very large -- though it still suffers from the "assume UTF8" problem. http://hackage.haskell.org/package/yajl
Is Data.Text an alternative? Can I use that everywhere, including for dealing with sockets (the API only mentions Handle).
Use 'Network.Socket.socketToHandle' to convert sockets to handles: http://hackage.haskell.org/packages/archive/network/2.2.1.7/doc/html/Network...
participants (3)
-
Daniel Fischer
-
John Millikin
-
JP Moresmau