UTF-8 in Haskell.

newer
The Bay Area Haskell Hackathon is...

Magicloud Magiclouds

23 Dec 2010 23 Dec '10

5:29 a.m.

Hi, Recently, I am reading ssh hackage (http://hackage.haskell.org/package/ssh). When at the part of deal with string, I got confused. I am not sure if this is a bug for the hackage, or I am just misunderstanding. An ascii char takes a Word8. So this works (LBS stands for Data.ByteString.Lazy): toLBS :: String -> LBS.ByteString toLBS = LBS.pack . map (fromIntegral . fromEnum) But a UTF-8 char takes a Int (Word32). Then I think the above code would break the data, right? If so, OK, then I think I could make a packInt which turns an Int into 4 Word8 first. Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK? -- 竹密岂妨流水过山高哪阻野云飞

Show replies by date

Mark Lentczner

23 Dec 23 Dec

6:01 a.m.

On Dec 22, 2010, at 9:29 PM, Magicloud Magiclouds wrote:

...

Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?

Generally, no. Haskell strings are sequences of Unicode characters. Each character has an integral code point value, from 0 to 0x10ffff, but technically, the code point itself is just a number, not a pattern of bits to be exchanged. That is an encoding. In any protocol you need know the encoding before you exchange characters as bytes or words. In some protocols it is implicit, in others explicit in header or meta data, and in yet others (IRC comes to mind) it is undefined (which makes problems for the user). The UTF-8 encoding uses a variable number of bytes to represent each character, depending on the code point, not Word32 as you suggested. Converting from Haskell's String to various encodings can be done with either the "text" package or "utf8-string" package. - Mark

Magicloud Magiclouds

6:15 a.m.

On Thu, Dec 23, 2010 at 2:01 PM, Mark Lentczner wrote:

...

On Dec 22, 2010, at 9:29 PM, Magicloud Magiclouds wrote:

...
Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?

Generally, no.

Haskell strings are sequences of Unicode characters. Each character has an integral code point value, from 0 to 0x10ffff, but technically, the code point itself is just a number, not a pattern of bits to be exchanged. That is an encoding.

In any protocol you need know the encoding before you exchange characters as bytes or words. In some protocols it is implicit, in others explicit in header or meta data, and in yet others (IRC comes to mind) it is undefined (which makes problems for the user).

The UTF-8 encoding uses a variable number of bytes to represent each character, depending on the code point, not Word32 as you suggested.

Converting from Haskell's String to various encodings can be done with either the "text" package or "utf8-string" package.

- Mark

I see. I just realize that, in this case (ssh), I could use CString to avoid all problems about encoding. -- 竹密岂妨流水过山高哪阻野云飞

Maciej Piechotka

7:56 a.m.

On Thu, 2010-12-23 at 14:15 +0800, Magicloud Magiclouds wrote:

...

On Thu, Dec 23, 2010 at 2:01 PM, Mark Lentczner wrote:

...
On Dec 22, 2010, at 9:29 PM, Magicloud Magiclouds wrote:

...
Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?

Generally, no.

Haskell strings are sequences of Unicode characters. Each character has an integral code point value, from 0 to 0x10ffff, but technically, the code point itself is just a number, not a pattern of bits to be exchanged. That is an encoding.

In any protocol you need know the encoding before you exchange characters as bytes or words. In some protocols it is implicit, in others explicit in header or meta data, and in yet others (IRC comes to mind) it is undefined (which makes problems for the user).

The UTF-8 encoding uses a variable number of bytes to represent each character, depending on the code point, not Word32 as you suggested.

Converting from Haskell's String to various encodings can be done with either the "text" package or "utf8-string" package.

- Mark

I see. I just realize that, in this case (ssh), I could use CString to avoid all problems about encoding.

By using CString you may avoid problems by putting them on users. CString is char * and Foreign marshaling just use ASCII. And as non only English speaking user of computer programs I ask to have support of unicode (for example utf-8). Unless you mean only commands, not data, in which you probably should check details of protocol. In any case I don't think that CString is correct approach to network data and you probably should use ByteString in place of CString. Regards

Max Bolingbroke

11:18 a.m.

On 23 December 2010 05:29, Magicloud Magiclouds wrote:

...

If so, OK, then I think I could make a packInt which turns an Int into 4 Word8 first. Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?

I think you are describing the UTF-32 encoding (under the assumption that fromEnum on Char returns the Unicode code point of that character, which I think is true). UTF-32 is capable of describing every Unicode code point so this is indeed non-lossy. UTF-32 is a reasonable wire transfer format (if a bit inefficient!). Don't roll your own encoding logic though, System.IO provides a TextEncoding for UTF-32 you can use to do the job more reliably. Cheers, Max

Magicloud Magiclouds

27 Dec 27 Dec

1:40 a.m.

Thanks for the ideas. In this case, ssh, it is a transfer layer protocol, which means it does not convert anything. For example the server was using ascii, and the client was using ascii, then good. If the client was using UTF-8 instead, then he might get a broken display, ssh itself would not care. My idea for CString is because in C, this is easy, "I" do not pay attention to which encode the given string is using. But I am not sure how CString works. If it just convert things into ASCII, then it is bad. On Thu, Dec 23, 2010 at 7:18 PM, Max Bolingbroke wrote:

...

On 23 December 2010 05:29, Magicloud Magiclouds wrote:

...
If so, OK, then I think I could make a packInt which turns an Int into 4 Word8 first. Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?

I think you are describing the UTF-32 encoding (under the assumption that fromEnum on Char returns the Unicode code point of that character, which I think is true). UTF-32 is capable of describing every Unicode code point so this is indeed non-lossy. UTF-32 is a reasonable wire transfer format (if a bit inefficient!).

Don't roll your own encoding logic though, System.IO provides a TextEncoding for UTF-32 you can use to do the job more reliably.

Cheers, Max

-- 竹密岂妨流水过山高哪阻野云飞

Antoine Latter

2:32 a.m.

Hi, What would you be using the CString for? A CString is really a lot less useful than a ByteString for almost all purposes. If I allready had a ByteString, the only reason I would want to convert it to a CString is to call a C function. Take care, Antoine On Sun, Dec 26, 2010 at 7:40 PM, Magicloud Magiclouds wrote:

...

Thanks for the ideas. In this case, ssh, it is a transfer layer protocol, which means it does not convert anything. For example the server was using ascii, and the client was using ascii, then good. If the client was using UTF-8 instead, then he might get a broken display, ssh itself would not care. My idea for CString is because in C, this is easy, "I" do not pay attention to which encode the given string is using. But I am not sure how CString works. If it just convert things into ASCII, then it is bad.

On Thu, Dec 23, 2010 at 7:18 PM, Max Bolingbroke wrote:

...
On 23 December 2010 05:29, Magicloud Magiclouds wrote:

...
If so, OK, then I think I could make a packInt which turns an Int into 4 Word8 first. Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?

I think you are describing the UTF-32 encoding (under the assumption that fromEnum on Char returns the Unicode code point of that character, which I think is true). UTF-32 is capable of describing every Unicode code point so this is indeed non-lossy. UTF-32 is a reasonable wire transfer format (if a bit inefficient!).

Don't roll your own encoding logic though, System.IO provides a TextEncoding for UTF-32 you can use to do the job more reliably.

Cheers, Max

-- 竹密岂妨流水过山高哪阻野云飞

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Magicloud Magiclouds

6:48 a.m.

Sorry, I just noticed that I had a misunderstanding here. With encode and bytestring hackages, I think it should be OK for my requirement. On Mon, Dec 27, 2010 at 10:32 AM, Antoine Latter wrote:

...

Hi,

What would you be using the CString for? A CString is really a lot less useful than a ByteString for almost all purposes. If I allready had a ByteString, the only reason I would want to convert it to a CString is to call a C function.

Take care, Antoine

On Sun, Dec 26, 2010 at 7:40 PM, Magicloud Magiclouds wrote:

...
Thanks for the ideas. In this case, ssh, it is a transfer layer protocol, which means it does not convert anything. For example the server was using ascii, and the client was using ascii, then good. If the client was using UTF-8 instead, then he might get a broken display, ssh itself would not care. My idea for CString is because in C, this is easy, "I" do not pay attention to which encode the given string is using. But I am not sure how CString works. If it just convert things into ASCII, then it is bad.

On Thu, Dec 23, 2010 at 7:18 PM, Max Bolingbroke wrote:

...
On 23 December 2010 05:29, Magicloud Magiclouds wrote:

...
If so, OK, then I think I could make a packInt which turns an Int into 4 Word8 first. Thus under all situation (ascii, UTF-8, or even UTF-32), my program always send 4 bytes through the network. Is that OK?

I think you are describing the UTF-32 encoding (under the assumption that fromEnum on Char returns the Unicode code point of that character, which I think is true). UTF-32 is capable of describing every Unicode code point so this is indeed non-lossy. UTF-32 is a reasonable wire transfer format (if a bit inefficient!).

Don't roll your own encoding logic though, System.IO provides a TextEncoding for UTF-32 you can use to do the job more reliably.

Cheers, Max

-- 竹密岂妨流水过山高哪阻野云飞

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- 竹密岂妨流水过山高哪阻野云飞

5305

Age (days ago)

5309

Last active (days ago)

List overview

Download

7 comments

5 participants

participants (5)

Antoine Latter
Maciej Piechotka
Magicloud Magiclouds
Mark Lentczner
Max Bolingbroke

UTF-8 in Haskell.

Magicloud Magiclouds

Mark Lentczner

Magicloud Magiclouds

Maciej Piechotka

Max Bolingbroke

Magicloud Magiclouds

Antoine Latter

Magicloud Magiclouds

tags

participants (5)