
[Crossposted to Haskell and Libraries. Replies to Libraries.] {- Good things about this text library design: * Efficient implementation should be straightforward * Character coder interface is public, so users can supply their own encodings, or write coder transformers (there are some in the proposal) Bad things: * There's no way to implement fgetpos/fsetpos type functionality, because coders don't expose their internal state. (In fact, there would need to be a way to explicitly copy the state, since it may well include IORefs, Ptrs, etc.) Is this a serious problem? -} module System.TextIOFirstDraft (...) where -- A BlockRecoder takes source and destination buffers and does some sort -- of translation between them. It returns the number of values (not -- bytes!) consumed and the number of values produced. It does not have to -- empty the input buffer or fill the output buffer on each call, but it -- must do something (i.e. it's not acceptable to return (0,0)). Coders -- will in general have internal state which is updated on each call. type BlockRecoder from to = Ptr from -> BlockLength -> Ptr to -> BlockLength -> IO (BlockLength,BlockLength) type TextEncoder = BlockRecoder Word32 Octet type TextDecoder = BlockRecoder Octet Word32 -- IO TextEncoder and IO TextDecoder below denote "coder factories" which -- produce a new coder in its initial state each time they're called. compatibilityEncoder :: IO TextEncoder -- "mod 256" currentLocaleEncoder :: IO TextEncoder iso88591Encoder :: IO TextEncoder latin1Encoder = iso88591Encoder utf8Encoder, utf16BEEncoder, utf16LEEncoder, utf32BEEncoder, utf32LEEncoder :: IO TextEncoder compatibilityDecoder :: IO TextDecoder currentLocaleDecoder :: IO TextDecoder iso88591Decoder :: IO TextDecoder latin1Decoder = iso88591Decoder utf8Decoder, utf16BEDecoder, utf16LEDecoder, utf32BEDecoder, utf32LEDecoder :: IO TextDecoder -- An attempt at supporting setlocale-type locale strings. lookupEncoder :: String -> Maybe (IO TextEncoder) lookupDecoder :: String -> Maybe (IO TextDecoder) -- prependBOM takes an existing UTF encoder and causes it to prepend -- a BOM (byte-order mark) to its output. autodetectUTF takes an existing -- decoder and modifies it to check for a BOM, switching to the -- appropriate type of UTF decoding if one is found. prependBOM :: IO TextEncoder -> IO TextEncoder autodetectUTF :: IO TextDecoder -> IO TextDecoder -- Attaches a TextInputChannel to the supplied InputChannel. After -- this operation the InputChannel should be considered owned by the -- TextInputChannel; any attempt to use it directly will cause -- unpredictable results. This takes a decoder factory rather than a -- decoder to prevent the error of attaching the same decoder to more than -- one channel. icAttachTextDecoder :: InputChannel -> IO TextDecoder -> IO TextInputChannel ticGetChar :: TextInputChannel -> IO Char ticGetLine :: TextInputChannel -> IO String ticLazyGetContents :: TextInputChannel -> IO String ocAttachTextEncoder :: OutputChannel -> IO TextEncoder -> IO TextOutputChannel tocPutChar :: TextOutputChannel -> Char -> IO () tocPutStr :: TextOutputChannel -> String -> IO () tocPutStrLn :: TextOutputChannel -> String -> IO () -- ... etc ... -- Ben

I'm not subscribed to libraries@haskell.org but, FWIW, On Thu, 31 Jul 2003, Ben Rudiak-Gould wrote: (snip)
iso88591Encoder :: IO TextEncoder (snip)
it might be nice to give iso88591Encoder a name that still looks okay and doesn't run the two different number groups together without any separation (assuming it pertains to ISO 8859-1). A rather minor issue, perhaps, given that the near future isn't likely to make the already-suggested scheme unduly ambiguous. (-: -- Mark

Ben Rudiak-Gould wrote:
module System.TextIOFirstDraft (...) where
-- A BlockRecoder takes source and destination buffers and does some sort -- of translation between them. It returns the number of values (not -- bytes!) consumed and the number of values produced. It does not have to -- empty the input buffer or fill the output buffer on each call, but it -- must do something (i.e. it's not acceptable to return (0,0)). Coders -- will in general have internal state which is updated on each call.
It would be preferable if this wasn't all within the IO monad. It
shouldn't be necessary, even for stateful encodings.
--
Glynn Clements

On Fri, 1 Aug 2003, Glynn Clements wrote:
Ben Rudiak-Gould wrote:
-- A BlockRecoder takes source and destination buffers and does some sort -- of translation between them. It returns the number of values (not -- bytes!) consumed and the number of values produced. It does not have to -- empty the input buffer or fill the output buffer on each call, but it -- must do something (i.e. it's not acceptable to return (0,0)). Coders -- will in general have internal state which is updated on each call.
It would be preferable if this wasn't all within the IO monad. It shouldn't be necessary, even for stateful encodings.
The problem is that all the Ptr access functions are in the IO monad. Switching to a different monad would require the use of unsafePerformIO or unsafeIOtoST, and I couldn't see any obvious way to guarantee safety. It seems better for the library user to make the unsafe conversions explicitly if they're to be made at all. The other possibility would be to drop the Ptrs and use STUArrays or immutable data structures, but I think that would inevitably be less efficient (especially given that many/most text coders are likely to be implemented using libc functions), and the standard text I/O library needs to be efficient. -- Ben

this can all be done safely in the state monad (when using external c calls or not). I vaugly remember a binding to iconv floating around which was based on the state monad. there was also an implementation (based on IO, but rather advanced) in the qforeign distribution which might be useful to look at.. (See Conv*.hs) John On Thu, Jul 31, 2003 at 11:19:48PM -0700, Ben Rudiak-Gould wrote:
On Fri, 1 Aug 2003, Glynn Clements wrote:
Ben Rudiak-Gould wrote:
-- A BlockRecoder takes source and destination buffers and does some sort -- of translation between them. It returns the number of values (not -- bytes!) consumed and the number of values produced. It does not have to -- empty the input buffer or fill the output buffer on each call, but it -- must do something (i.e. it's not acceptable to return (0,0)). Coders -- will in general have internal state which is updated on each call.
It would be preferable if this wasn't all within the IO monad. It shouldn't be necessary, even for stateful encodings.
The problem is that all the Ptr access functions are in the IO monad. Switching to a different monad would require the use of unsafePerformIO or unsafeIOtoST, and I couldn't see any obvious way to guarantee safety. It seems better for the library user to make the unsafe conversions explicitly if they're to be made at all.
The other possibility would be to drop the Ptrs and use STUArrays or immutable data structures, but I think that would inevitably be less efficient (especially given that many/most text coders are likely to be implemented using libc functions), and the standard text I/O library needs to be efficient.
-- Ben
-- Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries
-- --------------------------------------------------------------------------- John Meacham - California Institute of Technology, Alum. - john@foo.net ---------------------------------------------------------------------------

On Thu, 31 Jul 2003, John Meacham wrote:
On Thu, Jul 31, 2003 at 11:19:48PM -0700, Ben Rudiak-Gould wrote:
On Fri, 1 Aug 2003, Glynn Clements wrote:
Ben Rudiak-Gould wrote:
A BlockRecoder takes source and destination buffers and does some sort of translation between them...
It would be preferable if this wasn't all within the IO monad. It shouldn't be necessary, even for stateful encodings.
The problem is that all the Ptr access functions are in the IO monad...
this can all be done safely in the state monad (when using external c calls or not). I vaugly remember a binding to iconv floating around which was based on the state monad. there was also an implementation (based on IO, but rather advanced) in the qforeign distribution which might be useful to look at.. (See Conv*.hs)
Certainly the coder's internal state can be safely encapsulated in ST, even if it contains Ptrs internally. The problem is the Ptrs passed to the block-conversion function. Since they aren't parameterized by the state variable, you have no idea where they've been. I suppose you could create an STPtr type, but you'd have to impose such draconian restrictions on it to ensure safety that it hardly seems worth it. (STPtrs couldn't be passed to C code, for example.) If an application wants to use these converters in an ST thread then it's the application's responsibility to ensure safety, and so it should also be the application's responsibility to use unsafeIOtoST. I suspect that the iconv binding that you mentioned uses an ST-safe data type like a list or an STUArray. I don't think that's efficient enough to be used here (but I'd love to be proven wrong!). -- Ben

Dnia pią 1. sierpnia 2003 00:35, Ben Rudiak-Gould napisał:
type BlockRecoder from to = Ptr from -> BlockLength -> Ptr to -> BlockLength -> IO (BlockLength,BlockLength)
What should decoders do on malformed data?
type TextEncoder = BlockRecoder Word32 Octet type TextDecoder = BlockRecoder Octet Word32
I would use Char insetad of Word32. -- __("< Marcin Kowalczyk \__/ qrczak@knm.org.pl ^^ http://qrnik.knm.org.pl/~qrczak/
participants (5)
-
Ben Rudiak-Gould
-
Glynn Clements
-
John Meacham
-
Marcin 'Qrczak' Kowalczyk
-
Mark Carroll