
In article
[Crossposted to Haskell and Libraries. Replies to Libraries.]
There's a Haskell Internationalistion mailing list too. Also check out the project on SF: http://sourceforge.net/projects/haskell-i18n/ There's a bunch of my code for Unicode properties, plus a couple of UTF8 implementations.
module System.TextIOFirstDraft (...) where
could be put in Text.* hierarchy
type BlockRecoder from to = Ptr from -> BlockLength -> Ptr to -> BlockLength -> IO (BlockLength,BlockLength)
UArray and MArray would be slightly cleaner if you're doing the IO thing. But actually my biggest problem is that this is in the IO monad. Given your code, I should be able to write these without resorting to unsafePerformIO: encodeUTF8 :: String -> [Word8] decodeUTF8 :: [Word8] -> Maybe String -- Nothing if not valid Actually, if one makes certain assumptions about encodings, you could get away with something like this: type Encoder base t = t -> [base] type Decoder base t = forall m. (Monad m) => m base -> m t Is this any less efficient? Probably not if you're writing your BlockRecoders in Haskell.
type TextEncoder = BlockRecoder Word32 Octet type TextDecoder = BlockRecoder Octet Word32
On GHC, Char has exactly the range 0 to 0x10FFFF, as per Unicode codepoints. If this becomes standardised as part of an internationalisation effort, you might want to use Char rather than Word32. -- Ashley Yakeley, Seattle WA

In article <200308050848.BAA15295@mail4.halcyon.com>, I wrote:
Actually, if one makes certain assumptions about encodings, you could get away with something like this:
type Encoder base t = t -> [base] type Decoder base t = forall m. (Monad m) => m base -> m t
Is this any less efficient? Probably not if you're writing your BlockRecoders in Haskell.
OK, for something a bit faster for coders written in C, how about: type Encoder base t = UArray Int t -> UArray Int base; data Decoder base t = forall s. MkDecoder (UArray Int base -> Maybe s -> (s,UArray Int t)); -- Ashley Yakeley, Seattle WA

Some general points about recoders: 1. We don't really need to work out a single recoder interface that's best for every purpose. All the types people have proposed can happily coexist, and in many cases it's easy to write functions which convert from one to another. 2. There are regional encodings in use which are not stateless (JIS is the only one I know of personally), so I think that a currentLocaleEncoding function will have to return a stateful interface. 3. I expect that most conversion functions will be implemented in C code, for three reasons: (a) performance; (b) the C code already exists, is well-tested, knows about correct locale handling on different systems, etc., and (c) every Haskell executable will become hundreds of kilobytes larger if it has to include translation tables for every locale in which it might be run. We have little choice but to use the C library's tables, and the only interface to them is the C conversion functions. 4. Existing C conversion libraries (whose interfaces we can't change) store their state in opaque data structures and provide no way to copy that state. So all use has to be single-threaded, and if we want to enforce that statically we can't use an explicitly threaded interface. ST and IO are the only options. -- Ben
participants (2)
-
Ashley Yakeley
-
Ben Rudiak-Gould