
All, I finally got around to putting a "hexdump" function into a module. Any suggestions as to where it should live in the hierarchy? Dominic. module Hex( ppHex ) where import Data.List import Codec.Utils import Numeric import Text.PrettyPrint split :: Int -> [a] -> [[a]] split n xs = unfoldr (g n) xs g :: Int -> [a] -> Maybe ([a],[a]) g n [] = Nothing g n y = Just (splitAt n y) sh x | x < 16 = '0':(showHex x "") | otherwise = showHex x "" type OctetsPerLine = Int ppHex :: OctetsPerLine -> [Octet] -> Doc ppHex n = vcat . map hcat . map (intersperse colon) . map (map (text . sh)) . split n

Dominic Steinitz
I finally got around to putting a "hexdump" function into a module. Any suggestions as to where it should live in the hierarchy?
How about Codec.Hexdump Codec.Text.Hexdump Text.Codec.Hexdump I was also thinking it would be nice to have pure Haskell implementations of the various Unicode encodings. Here is my attempt at the UTF-8 codec. Like Dominic, I am wondering where such a module should live in the hierarchy? Text.Codec.UTF8 Codec.Text.UTF8 Text.UTF8 Codec.UTF8 Regards, Malcolm

Oops, I wrote: fromUTF8 (w:ws) | w < 0x80 {- 0xxxxxxx -} = toEnum (fromEnum w) : fromUTF8 ws | w >= 0xc0 {- 1111110x -} = bytes 5 (fromEnum (w`mask`0x01)) ws | w >= 0xe0 {- 111110xx -} = bytes 4 (fromEnum (w`mask`0x03)) ws | w >= 0xf0 {- 11110xxx -} = bytes 3 (fromEnum (w`mask`0x07)) ws | w >= 0xf8 {- 1110xxxx -} = bytes 2 (fromEnum (w`mask`0x0f)) ws | w >= 0xfc {- 110xxxxx -} = bytes 1 (fromEnum (w`mask`0x1f)) ws which should of course have been fromUTF8 (w:ws) | w < 0x80 {- 0xxxxxxx -} = toEnum (fromEnum w) : fromUTF8 ws | w >= 0xfc {- 1111110x -} = bytes 5 (fromEnum (w`mask`0x01)) ws | w >= 0xf8 {- 111110xx -} = bytes 4 (fromEnum (w`mask`0x03)) ws | w >= 0xf0 {- 11110xxx -} = bytes 3 (fromEnum (w`mask`0x07)) ws | w >= 0xe0 {- 1110xxxx -} = bytes 2 (fromEnum (w`mask`0x0f)) ws | w >= 0xc0 {- 110xxxxx -} = bytes 1 (fromEnum (w`mask`0x1f)) ws Regards, Malcolm

Malcolm Wallace wrote:
Oops, I wrote:
fromUTF8 (w:ws) | w < 0x80 {- 0xxxxxxx -} = toEnum (fromEnum w) : fromUTF8 ws | w >= 0xc0 {- 1111110x -} = bytes 5 (fromEnum (w`mask`0x01)) ws | w >= 0xe0 {- 111110xx -} = bytes 4 (fromEnum (w`mask`0x03)) ws | w >= 0xf0 {- 11110xxx -} = bytes 3 (fromEnum (w`mask`0x07)) ws | w >= 0xf8 {- 1110xxxx -} = bytes 2 (fromEnum (w`mask`0x0f)) ws | w >= 0xfc {- 110xxxxx -} = bytes 1 (fromEnum (w`mask`0x1f)) ws
which should of course have been
fromUTF8 (w:ws) | w < 0x80 {- 0xxxxxxx -} = toEnum (fromEnum w) : fromUTF8 ws | w >= 0xfc {- 1111110x -} = bytes 5 (fromEnum (w`mask`0x01)) ws | w >= 0xf8 {- 111110xx -} = bytes 4 (fromEnum (w`mask`0x03)) ws | w >= 0xf0 {- 11110xxx -} = bytes 3 (fromEnum (w`mask`0x07)) ws | w >= 0xe0 {- 1110xxxx -} = bytes 2 (fromEnum (w`mask`0x0f)) ws | w >= 0xc0 {- 110xxxxx -} = bytes 1 (fromEnum (w`mask`0x1f)) ws
Getting a UTF-8 decoder right is quite non-trivial. Take a look at this: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt I made a half-hearted attempt to get most of this right in GHC's UTF-8 decoder, but by no means all of it is implemented. I do think it would be nice if the Haskell implementation was correct, for some value of correct, though. Cheers, Simon

Simon Marlow wrote:
Getting a UTF-8 decoder right is quite non-trivial. Take a look at this:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
I made a half-hearted attempt to get most of this right in GHC's UTF-8 decoder, but by no means all of it is implemented. I do think it would be nice if the Haskell implementation was correct, for some value of correct, though.
There are a couple of implementations here: http://cvs.sourceforge.net/viewcvs.py/haskell-i18n/Source/Text/Encoding/ SF project: http://sourceforge.net/projects/haskell-i18n It hasn't been touched for a couple of years. -- Ashley Yakeley, Seattle WA WWED? http://www.cs.utexas.edu/users/EWD/

Malcolm Wallace wrote:
Dominic Steinitz
wrote: I finally got around to putting a "hexdump" function into a module. Any suggestions as to where it should live in the hierarchy?
How about Codec.Hexdump Codec.Text.Hexdump Text.Codec.Hexdump
I was also thinking it would be nice to have pure Haskell implementations of the various Unicode encodings. Here is my attempt at the UTF-8 codec. Like Dominic, I am wondering where such a module should live in the hierarchy?
Text.Codec.UTF8 Codec.Text.UTF8 Text.UTF8 Codec.UTF8
FWIW, our current draft hierarchy has this in Codec.Text.UTF8. (see http://darcs.haskell.org/ghc/libraries/doc/lib-hierarchy.html). Cheers, Simon

Hello Malcolm, Tuesday, March 21, 2006, 7:07:53 PM, you wrote:
I was also thinking it would be nice to have pure Haskell implementations of the various Unicode encodings. Here is my attempt at the UTF-8 codec.
UTF-8 codecs are migrating from app to app, you can find such code in the ghc, jhc, darcs... all these codecs use the ([Char] <-> [Word8]) conversion that is both slow (because lists are lazy) and can't be used in non-list environment (how, for example, we can read enough bytes to decode just one Char?). in my Streams library, i used higher-order monadic functions to implement encodings. In my model, encoder is just a higher-order function that accepts as parameter function (putByte :: (Monad m) => Word8 -> m ()) and uses it to implement (putChar :: (Monad m) => Char -> m ()) operation, so each encoder has type: utf8Encode :: (Monad m) => (Word8 -> m ()) -> Char -> m () In the same fashion, each decoder accepts parameter of functional type (getByte :: (Monad m) => m Word8), and uses it to implement (getChar :: (Monad m) => m Char) operation, so the whole decoder has type: utf8Decode :: (Monad m) => m Word8 -> m Char Using these higher-order functions allows me to implement both UTF8 (and any other) encoding for text streams and UTF8 encoding for serializing strings/chars in binary i/o module. i attached this module. -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (5)
-
Ashley Yakeley
-
Bulat Ziganshin
-
Dominic Steinitz
-
Malcolm Wallace
-
Simon Marlow