Parsing binary 'hierachical' objects for lazy developers

Hi Haskellers, I'm currently serializing / unserializing a bunch of bytestrings which are somehow related to each others and I'm wondering if there was a way in Haskell to ease my pain. The first thing I'm looking for, is to be able to automatically derive "Serializable" objects, for example: --------------------------------------------------------------- import Data.Serialize -- using cereal as an example data MyFlag = One | Two | Three instance Serialize [MyFlag] where put = putWord16le . marshalFlags get = unmarshal `fmap` getWord16le data ObjectA = ObjectA { attribute0 :: Word8 , attribute1 :: Word16le , attribute2 :: [MyFlag] } deriving (Serialize) -- magic goes here! --------------------------------------------------------------- Unfortunately ghci complains that 'Serialize' is not a derivable class. Yet, deriving the Serialize instance for ObjectA should be simple, since all the three attributes are already serializable themselves... Second issue, I would like to find a way to dispatch parsers. I'm not very good at expressing my problem in english, so I will use another code example: --------------------------------------------------------------- -- let's say we have two objects with almost the same structure: data ObjectA = ObjectA { objLength :: Int , objType :: TypeId , attribute2a :: [MyFlag] } data ObjectB = ObjectB { objLength :: Int , objType :: TypeId , attribute2b :: Word32le } --------------------------------------------------------------- When we begin to deserialize theses objects, we don't know their final type, we just know how to read their length and their typeId. Only then can we determine if what we are parsing is an ObjectA or an ObjectB. Once we now the object type, we can resume the parsing and return either an ObjectA or ObjectB. Oki, so I may have read too much of Peter Seibel's chapter on binary-data parsing in Common Lisp or spent too much time working on object-oriented code, but currently, I have no idea on how to write this 'simply' in Haskell :( any help would be welcome /john

On Wed, 2011-04-27 at 20:16 +0200, John Obbele wrote:
Hi Haskellers,
I'm currently serializing / unserializing a bunch of bytestrings which are somehow related to each others and I'm wondering if there was a way in Haskell to ease my pain.
The first thing I'm looking for, is to be able to automatically derive "Serializable" objects, for example:
--------------------------------------------------------------- import Data.Serialize -- using cereal as an example
data MyFlag = One | Two | Three
instance Serialize [MyFlag] where put = putWord16le . marshalFlags get = unmarshal `fmap` getWord16le
data ObjectA = ObjectA { attribute0 :: Word8 , attribute1 :: Word16le , attribute2 :: [MyFlag] } deriving (Serialize) -- magic goes here! ---------------------------------------------------------------
Unfortunately ghci complains that 'Serialize' is not a derivable class. Yet, deriving the Serialize instance for ObjectA should be simple, since all the three attributes are already serializable themselves...
Second issue, I would like to find a way to dispatch parsers. I'm not very good at expressing my problem in english, so I will use another code example:
--------------------------------------------------------------- -- let's say we have two objects with almost the same structure: data ObjectA = ObjectA { objLength :: Int , objType :: TypeId , attribute2a :: [MyFlag] }
data ObjectB = ObjectB { objLength :: Int , objType :: TypeId , attribute2b :: Word32le } ---------------------------------------------------------------
When we begin to deserialize theses objects, we don't know their final type, we just know how to read their length and their typeId.
Only then can we determine if what we are parsing is an ObjectA or an ObjectB.
Once we now the object type, we can resume the parsing and return either an ObjectA or ObjectB.
Oki, so I may have read too much of Peter Seibel's chapter on binary-data parsing in Common Lisp or spent too much time working on object-oriented code, but currently, I have no idea on how to write this 'simply' in Haskell :(
I believe following should work class Serializer ObjectA where get = check =<< (ObjectA <$> get <*> get <*> get) where check obj@(ObjectA len id attr) | len < 10 && id == 0 = return obj | otherwise = empty class Serializer ObjectB where get = check =<< (ObjectB <$> get <*> get <*> get) where check obj@(ObjectB len id attr) | len > 10 && id == 1 = return obj | otherwise = empty parseEitherAB :: Get (Either ObjectA ObjectB) parseEitherAB = (Left <$> get) <|> (Right <$> get) Regards
any help would be welcome /john

On Wed, Apr 27, 2011 at 09:46:08PM +0200, Maciej Marcin Piechotka wrote:
I believe following should work
class Serializer ObjectA where get = check =<< (ObjectA <$> get <*> get <*> get) where check obj@(ObjectA len id attr) | len < 10 && id == 0 = return obj | otherwise = empty
class Serializer ObjectB where get = check =<< (ObjectB <$> get <*> get <*> get) where check obj@(ObjectB len id attr) | len > 10 && id == 1 = return obj | otherwise = empty
parseEitherAB :: Get (Either ObjectA ObjectB) parseEitherAB = (Left <$> get) <|> (Right <$> get)
Hum, that's still not automatic but using Control.Applictive to write one-liners seems a good-enough solution. I will try it. thanks ,) /john

John Meacham's DrIFT tool used to get extended faster than GHC for things that "should" be automatic. I'm not sure of its current status, though: http://repetae.net/computer/haskell/DrIFT/ For your second problem, something like this: getAB :: Get (Either A B) getAB = do len <- getWord16be tag <- getWord16be if tag == 0x00 then do { a <- getA len; return (Left a) } else do { a <- getB len; return (Right b) } -- length already consumed so sent as an argument... getA :: Word16 -> Get A getB :: Word16 -> Get B

On Wed, Apr 27, 2011 at 11:16 AM, John Obbele
Hi Haskellers,
I'm currently serializing / unserializing a bunch of bytestrings which are somehow related to each others and I'm wondering if there was a way in Haskell to ease my pain.
The first thing I'm looking for, is to be able to automatically derive "Serializable" objects, for example:
Happstack has "Serialize" type class, and uses TemplateHaskell to automate deriving instances. I don't know if they are binary compatible with cereal (i.e., that you could serialize with one and deserialize with the other, or vice-versa)
--------------------------------------------------------------- import Data.Serialize -- using cereal as an example
data MyFlag = One | Two | Three
instance Serialize [MyFlag] where put = putWord16le . marshalFlags get = unmarshal `fmap` getWord16le
data ObjectA = ObjectA { attribute0 :: Word8 , attribute1 :: Word16le , attribute2 :: [MyFlag] } deriving (Serialize) -- magic goes here! ---------------------------------------------------------------
Unfortunately ghci complains that 'Serialize' is not a derivable class. Yet, deriving the Serialize instance for ObjectA should be simple, since all the three attributes are already serializable themselves...
Second issue, I would like to find a way to dispatch parsers. I'm not very good at expressing my problem in english, so I will use another code example:
This sounds very hard in the general case. Others have shown you how to dispatch on two types. But there is no general data type which combines all (or even arbitrarily many) types. Somehow, "Read" is able to do this, but I don't know what kind of magic it uses.
--------------------------------------------------------------- -- let's say we have two objects with almost the same structure: data ObjectA = ObjectA { objLength :: Int , objType :: TypeId , attribute2a :: [MyFlag] }
data ObjectB = ObjectB { objLength :: Int , objType :: TypeId , attribute2b :: Word32le } ---------------------------------------------------------------
When we begin to deserialize theses objects, we don't know their final type, we just know how to read their length and their typeId.
Only then can we determine if what we are parsing is an ObjectA or an ObjectB.
Once we now the object type, we can resume the parsing and return either an ObjectA or ObjectB.
Oki, so I may have read too much of Peter Seibel's chapter on binary-data parsing in Common Lisp or spent too much time working on object-oriented code, but currently, I have no idea on how to write this 'simply' in Haskell :(
any help would be welcome /john
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 27 April 2011 21:28, Alexander Solla
On Wed, Apr 27, 2011 at 11:16 AM, John Obbele
wrote: Second issue, I would like to find a way to dispatch parsers. I'm not very good at expressing my problem in english, so I will use another code example:
This sounds very hard in the general case. Others have shown you how to dispatch on two types. But there is no general data type which combines all (or even arbitrarily many) types. Somehow, "Read" is able to do this, but I don't know what kind of magic it uses.
Read always "demands its type" so it doesn't use any magic - if the input string doesn't conform it will throw an error. Any sensible binary format will have a scheme such as tag byte prefixes to control choice in parsing (binary parsing generally avoids all backtracking). If your binary data doesn't have a proper scheme it will be hard to parse for any language (or cast-to in the case of C), so the most sensible answer is to revise the format.

On Wed, Apr 27, 2011 at 10:18:47PM +0100, Stephen Tetley wrote:
On 27 April 2011 21:28, Alexander Solla
wrote: On Wed, Apr 27, 2011 at 11:16 AM, John Obbele
wrote: Second issue, I would like to find a way to dispatch parsers. I'm not very good at expressing my problem in english, so I will use another code example:
This sounds very hard in the general case. Others have shown you how to dispatch on two types. But there is no general data type which combines all (or even arbitrarily many) types. Somehow, "Read" is able to do this, but I don't know what kind of magic it uses.
Read always "demands its type" so it doesn't use any magic - if the input string doesn't conform it will throw an error.
Any sensible binary format will have a scheme such as tag byte prefixes to control choice in parsing (binary parsing generally avoids all backtracking). If your binary data doesn't have a proper scheme it will be hard to parse for any language (or cast-to in the case of C), so the most sensible answer is to revise the format.
Oki, so far the use of the Control.Applicative magic, the syntax sugar for monadic operations and manually written it/then/else or 'case of' branching statements have helped me considerably in the parsing task. I have not try DrIFT since I prefer to avoid pre-processors for now. So the only quirk that is still upsetting me is the 'deriving' issue: if I know that what I am parsing could only result in ObjectA or ObjectB, every thing would be simple. But when someone decides to add an extension to the binary format, let's say add a new tag identifier and a new ObjectC with a different size and new attributes, I will have to re-write part of my Haskell parser. I think, I will just have to rewrite the abstract type to 'data AbstractObject = ObjectA | ObjectB | ObjectC' let my parser still have the type signature: 'parser :: B.ByteString -> AbstractObject' and modify the branching inside to add the C tag identifier: 'case tag identifier of A -> ... B -> ... C -> ...'. It's not straightforward but it should be manageable. regards, /john
participants (4)
-
Alexander Solla
-
John Obbele
-
Maciej Marcin Piechotka
-
Stephen Tetley