converting prefixes of CString <-> String

I have been reading Foreign.C.String but it does not seem to provide the functionality I was looking for. Let 'c2h' convert CStrings to Haskell Strings, and 'h2c' convert Haskell Strings to CStrings. (If I understand correctly, c2h . h2c === id, but h2c . c2h is not the identity on all inputs; or perhaps c2h is not defined for all CStrings. Probably this is all locale dependent.) I have an infinite Haskell String transferred byte-wise over a network; I would like to convert some prefix of the bytes received into a prefix of the String I started with. However, if I understand correctly, if "s" is a Haskell String it is not necessarily true that "c2h (take n (h2c s))" is a prefix of s for all n. So I have two questions: Given a CString of the form "cs = take n (h2c s)", how do I know whether "c2h cs" is a prefix of s or not? Is there a way to recognize whether a CString is "valid" as opposed to truncated in the middle of a code point, or is this impossible? Better yet, given a CString "cs = take n (h2c s)", is there a way to find the maximal prefix cs' of cs such that c2h cs' is a prefix of s? If s == s1 ++ s2, is it necessarily true that s == (c2h (h2c s1)) ++ (c2h (h2c s2))? If so, then I can perform my conversion a bit at a time, otherwise I'd need to start from the beginning of the cstring each time I receive additional data. In practice, I think my solution will come down to restricting my program to only using the lower 128 characters, but I'd like to know how to handle this problem in full generality. Thanks, Eric

On 25 Apr 2011, at 08:16, Eric Stansifer wrote:
Let 'c2h' convert CStrings to Haskell Strings, and 'h2c' convert Haskell Strings to CStrings. (If I understand correctly, c2h . h2c === id, but h2c . c2h is not the identity on all inputs;
That is correct. CStrings are 8-bits, and Haskell Strings are 32-bits. Converting from Haskell to C loses information, unless you use a multi-byte encoding on the C side (for instance, UTF8).
or perhaps c2h is not defined for all CStrings.
Rather, h2c is not necessarily well-defined for all Haskell Strings. In particular, the marshalling functions in Foreign.C.String simply truncate any character larger than one byte, to its lowest byte. I suggest you look at the utf8-string package, for instance Codec.Binary.UTF8.String.{encode,decode}, which convert Haskell strings to/from a list of Word8, which can then be transferred via the FFI to wherever you like. Regards, Malcolm

Let 'c2h' convert CStrings to Haskell Strings, and 'h2c' convert Haskell Strings to CStrings. (If I understand correctly, c2h . h2c === id, but h2c . c2h is not the identity on all inputs;
That is correct. CStrings are 8-bits, and Haskell Strings are 32-bits. Converting from Haskell to C loses information, unless you use a multi-byte encoding on the C side (for instance, UTF8).
So actually I am incorrect, and h2c . c2h is the identity but c2h . h2c is not?
I suggest you look at the utf8-string package, for instance Codec.Binary.UTF8.String.{encode,decode}, which convert Haskell strings to/from a list of Word8, which can then be transferred via the FFI to wherever you like.
This package was very helpful; I looked at the source to see how the utf8 encoding was done. It looks as if the functionality I want is technically feasible but not implemented yet; it shouldn't be too much trouble to implement it myself, by imitating the existing 'decode' function but changing its behavior when it runs out of input in the middle of a utf8-character. Also key is the property s1 ++ s2 == decode (encode s1)) ++ decode (encode s2)) holds. Thanks, Eric

On 26 Apr 2011, at 13:31, Eric Stansifer wrote:
Let 'c2h' convert CStrings to Haskell Strings, and 'h2c' convert Haskell Strings to CStrings. (If I understand correctly, c2h . h2c === id, but h2c . c2h is not the identity on all inputs;
That is correct. CStrings are 8-bits, and Haskell Strings are 32-bits. Converting from Haskell to C loses information, unless you use a multi-byte encoding on the C side (for instance, UTF8).
So actually I am incorrect, and h2c . c2h is the identity but c2h . h2c is not?
Ah, my bad. In reading the composition from right to left, I inadvertently read h2c and c2h from right to left as well! So, starting from C, converting to Haskell, and back to C is the identity, yes. Starting from Haskell, no. Regards, Malcolm
participants (2)
-
Eric Stansifer
-
Malcolm Wallace