
Hi, Recently, I am reading ssh hackage (http://hackage.haskell.org/package/ssh). When at the part of deal with string, I got confused. I am not sure if this is a bug for the hackage, or I am just misunderstanding. An ascii char takes a Word8. So this works (LBS stands for Data.ByteString.Lazy): toLBS :: String -> LBS.ByteString toLBS = LBS.pack . map (fromIntegral . fromEnum) But a UTF-8 char takes a Int (Word32). Then I think the above code would break the data, right? If so, OK, then I think I could make a packInt which turns an Int into 4 Word8 first. Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK? -- 竹密岂妨流水过 山高哪阻野云飞

On Dec 22, 2010, at 9:29 PM, Magicloud Magiclouds wrote:
Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?
Generally, no. Haskell strings are sequences of Unicode characters. Each character has an integral code point value, from 0 to 0x10ffff, but technically, the code point itself is just a number, not a pattern of bits to be exchanged. That is an encoding. In any protocol you need know the encoding before you exchange characters as bytes or words. In some protocols it is implicit, in others explicit in header or meta data, and in yet others (IRC comes to mind) it is undefined (which makes problems for the user). The UTF-8 encoding uses a variable number of bytes to represent each character, depending on the code point, not Word32 as you suggested. Converting from Haskell's String to various encodings can be done with either the "text" package or "utf8-string" package. - Mark

On Thu, Dec 23, 2010 at 2:01 PM, Mark Lentczner
On Dec 22, 2010, at 9:29 PM, Magicloud Magiclouds wrote:
Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?
Generally, no.
Haskell strings are sequences of Unicode characters. Each character has an integral code point value, from 0 to 0x10ffff, but technically, the code point itself is just a number, not a pattern of bits to be exchanged. That is an encoding.
In any protocol you need know the encoding before you exchange characters as bytes or words. In some protocols it is implicit, in others explicit in header or meta data, and in yet others (IRC comes to mind) it is undefined (which makes problems for the user).
The UTF-8 encoding uses a variable number of bytes to represent each character, depending on the code point, not Word32 as you suggested.
Converting from Haskell's String to various encodings can be done with either the "text" package or "utf8-string" package.
- Mark
I see. I just realize that, in this case (ssh), I could use CString to avoid all problems about encoding. -- 竹密岂妨流水过 山高哪阻野云飞

On Thu, 2010-12-23 at 14:15 +0800, Magicloud Magiclouds wrote:
On Thu, Dec 23, 2010 at 2:01 PM, Mark Lentczner
wrote: On Dec 22, 2010, at 9:29 PM, Magicloud Magiclouds wrote:
Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?
Generally, no.
Haskell strings are sequences of Unicode characters. Each character has an integral code point value, from 0 to 0x10ffff, but technically, the code point itself is just a number, not a pattern of bits to be exchanged. That is an encoding.
In any protocol you need know the encoding before you exchange characters as bytes or words. In some protocols it is implicit, in others explicit in header or meta data, and in yet others (IRC comes to mind) it is undefined (which makes problems for the user).
The UTF-8 encoding uses a variable number of bytes to represent each character, depending on the code point, not Word32 as you suggested.
Converting from Haskell's String to various encodings can be done with either the "text" package or "utf8-string" package.
- Mark
I see. I just realize that, in this case (ssh), I could use CString to avoid all problems about encoding.
By using CString you may avoid problems by putting them on users. CString is char * and Foreign marshaling just use ASCII. And as non only English speaking user of computer programs I ask to have support of unicode (for example utf-8). Unless you mean only commands, not data, in which you probably should check details of protocol. In any case I don't think that CString is correct approach to network data and you probably should use ByteString in place of CString. Regards

On 23 December 2010 05:29, Magicloud Magiclouds
If so, OK, then I think I could make a packInt which turns an Int into 4 Word8 first. Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?
I think you are describing the UTF-32 encoding (under the assumption that fromEnum on Char returns the Unicode code point of that character, which I think is true). UTF-32 is capable of describing every Unicode code point so this is indeed non-lossy. UTF-32 is a reasonable wire transfer format (if a bit inefficient!). Don't roll your own encoding logic though, System.IO provides a TextEncoding for UTF-32 you can use to do the job more reliably. Cheers, Max

Thanks for the ideas.
In this case, ssh, it is a transfer layer protocol, which means it
does not convert anything. For example the server was using ascii, and
the client was using ascii, then good. If the client was using UTF-8
instead, then he might get a broken display, ssh itself would not
care.
My idea for CString is because in C, this is easy, "I" do not pay
attention to which encode the given string is using.
But I am not sure how CString works. If it just convert things into
ASCII, then it is bad.
On Thu, Dec 23, 2010 at 7:18 PM, Max Bolingbroke
On 23 December 2010 05:29, Magicloud Magiclouds
wrote: If so, OK, then I think I could make a packInt which turns an Int into 4 Word8 first. Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?
I think you are describing the UTF-32 encoding (under the assumption that fromEnum on Char returns the Unicode code point of that character, which I think is true). UTF-32 is capable of describing every Unicode code point so this is indeed non-lossy. UTF-32 is a reasonable wire transfer format (if a bit inefficient!).
Don't roll your own encoding logic though, System.IO provides a TextEncoding for UTF-32 you can use to do the job more reliably.
Cheers, Max
-- 竹密岂妨流水过 山高哪阻野云飞

Hi,
What would you be using the CString for? A CString is really a lot
less useful than a ByteString for almost all purposes. If I allready
had a ByteString, the only reason I would want to convert it to a
CString is to call a C function.
Take care,
Antoine
On Sun, Dec 26, 2010 at 7:40 PM, Magicloud Magiclouds
Thanks for the ideas. In this case, ssh, it is a transfer layer protocol, which means it does not convert anything. For example the server was using ascii, and the client was using ascii, then good. If the client was using UTF-8 instead, then he might get a broken display, ssh itself would not care. My idea for CString is because in C, this is easy, "I" do not pay attention to which encode the given string is using. But I am not sure how CString works. If it just convert things into ASCII, then it is bad.
On Thu, Dec 23, 2010 at 7:18 PM, Max Bolingbroke
wrote: On 23 December 2010 05:29, Magicloud Magiclouds
wrote: If so, OK, then I think I could make a packInt which turns an Int into 4 Word8 first. Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?
I think you are describing the UTF-32 encoding (under the assumption that fromEnum on Char returns the Unicode code point of that character, which I think is true). UTF-32 is capable of describing every Unicode code point so this is indeed non-lossy. UTF-32 is a reasonable wire transfer format (if a bit inefficient!).
Don't roll your own encoding logic though, System.IO provides a TextEncoding for UTF-32 you can use to do the job more reliably.
Cheers, Max
-- 竹密岂妨流水过 山高哪阻野云飞
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Sorry, I just noticed that I had a misunderstanding here.
With encode and bytestring hackages, I think it should be OK for my requirement.
On Mon, Dec 27, 2010 at 10:32 AM, Antoine Latter
Hi,
What would you be using the CString for? A CString is really a lot less useful than a ByteString for almost all purposes. If I allready had a ByteString, the only reason I would want to convert it to a CString is to call a C function.
Take care, Antoine
On Sun, Dec 26, 2010 at 7:40 PM, Magicloud Magiclouds
wrote: Thanks for the ideas. In this case, ssh, it is a transfer layer protocol, which means it does not convert anything. For example the server was using ascii, and the client was using ascii, then good. If the client was using UTF-8 instead, then he might get a broken display, ssh itself would not care. My idea for CString is because in C, this is easy, "I" do not pay attention to which encode the given string is using. But I am not sure how CString works. If it just convert things into ASCII, then it is bad.
On Thu, Dec 23, 2010 at 7:18 PM, Max Bolingbroke
wrote: On 23 December 2010 05:29, Magicloud Magiclouds
wrote: If so, OK, then I think I could make a packInt which turns an Int into 4 Word8 first. Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?
I think you are describing the UTF-32 encoding (under the assumption that fromEnum on Char returns the Unicode code point of that character, which I think is true). UTF-32 is capable of describing every Unicode code point so this is indeed non-lossy. UTF-32 is a reasonable wire transfer format (if a bit inefficient!).
Don't roll your own encoding logic though, System.IO provides a TextEncoding for UTF-32 you can use to do the job more reliably.
Cheers, Max
-- 竹密岂妨流水过 山高哪阻野云飞
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- 竹密岂妨流水过 山高哪阻野云飞
participants (5)
-
Antoine Latter
-
Maciej Piechotka
-
Magicloud Magiclouds
-
Mark Lentczner
-
Max Bolingbroke