Sending wide characters over the network socket

Hi, I have tried to send a string of Unicode characters over a socket (or to write it into a file handle). The result is strange: it looks like characters are truncated down to their least significant bytes. Here is my program (I am new in Haskell, so the code may look not so good, but it illustrates what I am trying to achieve) =========================================== module Main where import IO import Char import Control.Concurrent import Network.Socket -- A string of wide characters wide = [chr 0x1234, chr 0x5678, chr 0x4321, chr 0x8765, chr 0x102345] -- Function to "narrow" characters to their least significant byte narrow s = map (\c -> chr $ (ord c) `mod` 256) s -- Decode string printing all of its characters' order numbers strtodec str = foldr1 (++) ( map (\c -> " "++show(ord c)++" ") str) -- Server: receive a string and print it decoded, -- narrow it and print it again. -- It does not actually loop, though it could server=do sock <- socket AF_INET Stream 6 ia <- inet_addr "127.0.0.1" bindSocket sock $ SockAddrInet 2323 ia listen sock 15 srvloop sock where srvloop sock = do (acsock,from) <- accept sock instr <- recv acsock 128 putStrLn $ "Received: " ++(strtodec instr) putStrLn $ "Lower Bytes: " ++(strtodec $ narrow instr) sClose acsock sClose sock -- Client: send a string of wide characters client=do threadDelay 100 sock <- socket AF_INET Stream 6 ia <- inet_addr "127.0.0.1" connect sock $ SockAddrInet 2323 ia cnt <- send sock wide putStrLn $ "Sent "++(show cnt)++" bytes" putStrLn $ "Source was: "++(strtodec wide) sClose sock threadDelay 100 main=do forkIO (server) client =========================================== And here is its output =========================================== Sent 5 bytes Source was: 4660 22136 17185 34661 1057605 Received: 52 120 33 101 69 Lower Bytes: 52 120 33 101 69 =========================================== Honestly, I expected that 20 bytes were sent (or something smaller if they were sent in UTF), and "Received" be identical to "Source was". The last string of output is just to check whether those are indeed lower bytes shown, not some garbage. I am using a binary distribution of GHC 6.0 on Linux - are there any special conditions I have to enable for the source distribution to be able to send/receive Unicode characters? To be more general: how would I send arbitrary binary data (stream of octets) over a socket or a file handle? Should I always assume that only lower bytes would be sent, and this will be forever in ghc? Or is it a bug? The problem is, Handle/Socket functions require a String to be the type of data to exchange; not a, say [Int8]. Therefore, I need to be able to coerce my binary data buffer to a String. PS Of course, I could write my own socket functions, but I am looking for a more or less "pure" Haskell solution. PPS I tried this only with GHC. -- Dmitry M. Golubovsky South Lyon, MI

Dimitry Golubovsky wrote:
I have tried to send a string of Unicode characters over a socket (or to write it into a file handle). The result is strange: it looks like characters are truncated down to their least significant bytes.
Yep.
Honestly, I expected that 20 bytes were sent (or something smaller if they were sent in UTF), and "Received" be identical to "Source was". The last string of output is just to check whether those are indeed lower bytes shown, not some garbage.
I am using a binary distribution of GHC 6.0 on Linux - are there any special conditions I have to enable for the source distribution to be able to send/receive Unicode characters?
No, it just isn't supported. All of the Haskell I/O functions take the bottom octet and discard the top bits.
To be more general: how would I send arbitrary binary data (stream of octets) over a socket or a file handle? Should I always assume that only lower bytes would be sent, and this will be forever in ghc?
Yes. Well, maybe not forever, but for the forseeable future.
Or is it a bug?
No. It's just a fundamental design flaw in Haskell. Presumably someone thought that wide-character support was just a question of defining Char, and forgot about a minor issue called "I/O".
The problem is, Handle/Socket functions require a String to be the type of data to exchange; not a, say [Int8]. Therefore, I need to be able to coerce my binary data buffer to a String.
Correct. IOW, lots of messing around with ord and chr and either
mod/div or the Bits library.
--
Glynn Clements

Glynn Clements wrote:
To be more general: how would I send arbitrary binary data (stream of octets) over a socket or a file handle? Should I always assume that only lower bytes would be sent, and this will be forever in ghc?
Yes. Well, maybe not forever, but for the forseeable future.
Well, if it were there forever then one might rely upon it and just construct appropriate String containing only bottom octets in proper sequence.
Or is it a bug?
No. It's just a fundamental design flaw in Haskell. Presumably someone thought that wide-character support was just a question of defining Char, and forgot about a minor issue called "I/O".
Perhaps not Haskell itself; this is just a language (i. e. a tool to describe some program logic). I think the problem is: I/O functions expose the underlying OS (Unix, Windows) to a program, and that OS has absolutely no knowledge of what String is in Haskell (or Java for example). So data chunks must be presented to those functions in the way that the OS understands best. In other words, array of octets fits better than String. In Java runtime, this gap is filled with a bulky layer of various XXXOutputStream's and a bunch of encodings. I guess, Haskell implementation could just enhance the IO subsystem by adding functions taking arrays of octets as their arguments instead of String's. This would not break compatibility with existing programs using String's. What would be really nice to have is some sort of object input/output over network. Then, by sending a String, I would expect it to be recreated at the other end. And if the other end expects, say [Int] then an error would be detected. I haven't looked through the whole GHC runtime, and maybe something like this is already there, any clue is welcome. -- Dmitry M. Golubovsky South Lyon, MI

G'day all. On Sat, Jul 05, 2003 at 11:44:54PM -0400, Dimitry Golubovsky wrote:
What would be really nice to have is some sort of object input/output over network. Then, by sending a String, I would expect it to be recreated at the other end. And if the other end expects, say [Int] then an error would be detected.
Well there is already a binding for COM (which isn't portable, of course), a beta binding for CORBA, several XML libaries and a half-finished binding for ASN.1. Any of these would get you at least part of the way there. I think what you are asking for, though, is some equivalent of Java's serializable objects. I'm not convinced that producing yet another RPC protocol would be a useful thing, even if it was optimised for Haskell. It seems to me that adequately supporting the large number of existing standards (off the top of my head: DCOM, CORBA, XML-RPC, SOAP, .NET, ASN.1 and Sun RPC; no doubt I've missed a few) would be a far better use of effort. One thing which would be good was a more extensible "derives" keyword which lets you automatically derive code to marshall and demarshall for whatever RPC protocol you want. I'd love to be able to write: data SomeComplexType = Stuff MoreStuff EvenMoreStuff | AndSomeMore AdNauseam deriving( Asn1.Ber, Asn1.Xer, XmlRpc, DCom ) and have it automatically generate any interface definitions (IDL, ASN.1 specifications, whatever) suitable to hold the data type.
I haven't looked through the whole GHC runtime, and maybe something like this is already there, any clue is welcome.
The only built-in thing is Read/Show, which is functional, but slow, as its representation is textual. Cheers, Andrew Bromage
participants (3)
-
Andrew J Bromage
-
Dimitry Golubovsky
-
Glynn Clements