
Hi, I'm relatively new to Haskell so please bear with me. I'm trying to parse Java class files with Data.Binary and I'm having a few problems: (The class file format is described here: http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html and the bytecode instructions are described here: http://java.sun.com/docs/books/jvms/second_edition/html/Instructions.doc.htm... ) 1. The class file format contains a number of tables. The table definitions start with the length of the list and carry on with that many table entries. Lists would be a good representation for them in Haskell, because there is not need to index them directly (except with the constants table). I've created my own list type so that I can redefine the serialisation functions for it so that the serialisation matches the format defined in the class file format: newtype MyList e = MkList ([e]) deriving Show instance (Binary e) => Binary (MyList e) where put (MkList es) = do put (fromIntegral (length es) :: Word16) mapM_ put es get = do n <- get :: Get Word16 xs <- replicateM (fromIntegral n) get return (MkList xs) The problem is that one of the tables, namely the attribute_info structures, use a u32 length field whereas all the other tables use a u16 length field. My implementation uses u16, but it would be nice to be able to use the same data type for both types of tables. I think I can do it by adding a lenght field to MyList and specifying the type when I use MyList in some other data structure, but that would also mean that I have to keep track of the length of the list manually? I'm basically copy-pasting the same code just to use a u32 length field in the serialised form: data Info = MkInfo [Word8] deriving Show instance Binary Info where put (MkInfo xs) = do put (fromIntegral (length xs) :: Word32) mapM_ put xs get = do n <- get :: Get Word32 xs <- replicateM (fromIntegral n) get return (MkInfo xs) 2. This is the bigger problem. The Java class file contains a constants table in the beginning of the file. The other fields later on in the class file contain indexes that reference entries in that constants table. So in order to be able to replace an index in a data structure with the actual string, I need to be able to look up the string from the constants table while I'm deserialising the field. My guess is that I should be able to put the constants table into a state monad. On the other hand Data.Binary already uses the state monad for holding onto the binary data being deserialised. So it's not clear to me if I can use StateT with Data.Binary.Get? And if not, can I implement my own state monad and do it that way? I'm not very comfortable with Monads yet, so I might be missing something very obvious. This is what the get function looks like in my top-level Data.Binary instance: get = do magic <- get :: Get Word32 case magic == magicNumber of -- class files start with 0xCAFEBABE True -> do min <- get -- minor version number u16 maj <- get -- major version number u16 c <- get -- constants table a <- get -- access flags (public, abstract, ...) u16 t <- get -- a u16 index pointing to the name of this class in the constants table s <- get -- a u16 index pointing to the name of the super class in the constants table i <- get -- a table of interfaces f <- get -- a table of fields m <- get -- a table of methods attrs <- get -- a table of class level attributes return (ClassFile (min, maj) c a t s i f m attrs) False -> error "Invalid magic number" Thanks for all the help! -- ! Lauri

On Sun, 2008-01-20 at 18:18 +0000, Lauri Pesonen wrote:
Hi,
I'm relatively new to Haskell so please bear with me. I'm trying to parse Java class files with Data.Binary and I'm having a few problems:
(The class file format is described here: http://java.sun.com/docs/books/jvms/second_edition/html/ClassFile.doc.html and the bytecode instructions are described here: http://java.sun.com/docs/books/jvms/second_edition/html/Instructions.doc.htm... )
1. The class file format contains a number of tables. The table definitions start with the length of the list and carry on with that many table entries. Lists would be a good representation for them in Haskell, because there is not need to index them directly (except with the constants table). I've created my own list type so that I can redefine the serialisation functions for it so that the serialisation matches the format defined in the class file format:
newtype MyList e = MkList ([e]) deriving Show
instance (Binary e) => Binary (MyList e) where put (MkList es) = do put (fromIntegral (length es) :: Word16) mapM_ put es
get = do n <- get :: Get Word16 xs <- replicateM (fromIntegral n) get return (MkList xs)
The problem is that one of the tables, namely the attribute_info structures, use a u32 length field whereas all the other tables use a u16 length field. My implementation uses u16, but it would be nice to be able to use the same data type for both types of tables. I think I can do it by adding a lenght field to MyList and specifying the type when I use MyList in some other data structure, but that would also mean that I have to keep track of the length of the list manually?
You may want to consider using the other side of Data.Binary rather than the Binary class. The -class- Binary is intended for de/serialization when you don't care about the format. From the documentation: "For parsing and generating simple external binary formats (e.g. C structures), Binary may be used, but in general is not suitable for complex protocols. Instead use the Put and Get primitives directly." Nevertheless, one way to solve your problem is with a phantom type. Change MyList to, newtype MyList t e = MkList [e] deriving Show getLengthType :: MyList t e -> t getLengthType = undefined instance (Binary e) => Binary (MyList t e) where put l@(MkList es) = do put (fromIntegral (length es) `asTypeOf` getLengthType l) mapM_ put es get = do n <- get xs <- replicateM (fromIntegral (n `asTypeOf` getLengthType t)) get return (MkList xs `asTypeOf` t) where t = undefined The asTypeOfs are just to propagate the type information around. GHC's extension for scoped type variables would make this code simpler and more direct. At any rate, now the code will use the Binary instance for whatever type t is to serialize the length.
2. This is the bigger problem. The Java class file contains a constants table in the beginning of the file. The other fields later on in the class file contain indexes that reference entries in that constants table. So in order to be able to replace an index in a data structure with the actual string, I need to be able to look up the string from the constants table while I'm deserialising the field.
My guess is that I should be able to put the constants table into a state monad. On the other hand Data.Binary already uses the state monad for holding onto the binary data being deserialised. So it's not clear to me if I can use StateT with Data.Binary.Get? And if not, can I implement my own state monad and do it that way? I'm not very comfortable with Monads yet, so I might be missing something very obvious.
If you mean that you there references to the constant table in e.g. the fields table then the problem here is that you need to the class methods to use that monad transformer (in this case, ReaderT is all you should need and not even that), but you can't change their type. These are the kind of issues that make the Binary class unsuitable for this type of work. If that is the case, the only way to use this is to explicitly write out the deserialization code rather than relying on get, i.e. you'll have to write a function 'getTable constantTable' that will deserialize the table.

Hi Derek,
Thanks for the reply.
On 20/01/2008, Derek Elkins
You may want to consider using the other side of Data.Binary rather than the Binary class. The -class- Binary is intended for de/serialization when you don't care about the format. From the documentation:
"For parsing and generating simple external binary formats (e.g. C structures), Binary may be used, but in general is not suitable for complex protocols. Instead use the Put and Get primitives directly."
Yes, you are right. I read that bit of the documentation, but didn't understand how to use the Get and Put primitives. So I started by using the Binary class and after getting quick results carried on with it. I'll rewrite what I have with Get and Put once I figure out how they actually work.
Nevertheless, one way to solve your problem is with a phantom type. Change MyList to, newtype MyList t e = MkList [e] deriving Show
getLengthType :: MyList t e -> t getLengthType = undefined
...
The asTypeOfs are just to propagate the type information around. GHC's extension for scoped type variables would make this code simpler and more direct. At any rate, now the code will use the Binary instance for whatever type t is to serialize the length.
Ah, very clever. Is this a common idiom in Haskell?
If you mean that you there references to the constant table in e.g. the fields table then the problem here is that you need to the class methods to use that monad transformer (in this case, ReaderT is all you should need and not even that), but you can't change their type. These are the kind of issues that make the Binary class unsuitable for this type of work. If that is the case, the only way to use this is to explicitly write out the deserialization code rather than relying on get, i.e. you'll have to write a function 'getTable constantTable' that will deserialize the table.
As an example, a method is serialised in the class file as the following structure: method_info { u2 access_flags; u2 name_index; u2 descriptor_index; u2 attributes_count; attribute_info attributes[attributes_count]; } Both the name_index and the descriptor_index are indices that point into the constants pool. In Haskell I would represent this as the type data Method = Method Access String String [Attribute] So I want to replace the indices with the actual strings that they point to. I guess in the final deserialised version of the class file the constant pool would not exist at all and it would be recreated when the class file is serialised again. So, AFAICT, I should be able to first deserialise the constants table as part of deserialising the whole class file and then pass the constant pool into the functions that deserialise fields, methods etc. so that they are able to look up the constants from the pool. I have to stare at the Get and Put primitives as well as ReaderT to figure out how all this will work together. Let's call it a learning experience. -- ! Lauri

On Mon, 2008-01-21 at 16:18 +0000, Lauri Pesonen wrote:
Hi Derek,
Thanks for the reply.
On 20/01/2008, Derek Elkins
wrote: You may want to consider using the other side of Data.Binary rather than the Binary class. The -class- Binary is intended for de/serialization when you don't care about the format. From the documentation:
"For parsing and generating simple external binary formats (e.g. C structures), Binary may be used, but in general is not suitable for complex protocols. Instead use the Put and Get primitives directly."
Yes, you are right. I read that bit of the documentation, but didn't understand how to use the Get and Put primitives. So I started by using the Binary class and after getting quick results carried on with it. I'll rewrite what I have with Get and Put once I figure out how they actually work.
Nevertheless, one way to solve your problem is with a phantom type. Change MyList to, newtype MyList t e = MkList [e] deriving Show
getLengthType :: MyList t e -> t getLengthType = undefined
...
The asTypeOfs are just to propagate the type information around. GHC's extension for scoped type variables would make this code simpler and more direct. At any rate, now the code will use the Binary instance for whatever type t is to serialize the length.
Ah, very clever. Is this a common idiom in Haskell?
Phantom types are a very common technique. They allow you to use the static type system to enforce guarantees (and, in this case, generate code). Essentially they let you add types to an untyped representation giving you the benefits of both. Usually you can avoid doing the asTypeOf chicanery I did.
If you mean that you there references to the constant table in e.g. the fields table then the problem here is that you need to the class methods to use that monad transformer (in this case, ReaderT is all you should need and not even that), but you can't change their type. These are the kind of issues that make the Binary class unsuitable for this type of work. If that is the case, the only way to use this is to explicitly write out the deserialization code rather than relying on get, i.e. you'll have to write a function 'getTable constantTable' that will deserialize the table.
As an example, a method is serialised in the class file as the following structure:
method_info { u2 access_flags; u2 name_index; u2 descriptor_index; u2 attributes_count; attribute_info attributes[attributes_count]; }
Both the name_index and the descriptor_index are indices that point into the constants pool. In Haskell I would represent this as the type
data Method = Method Access String String [Attribute]
So I want to replace the indices with the actual strings that they point to. I guess in the final deserialised version of the class file the constant pool would not exist at all and it would be recreated when the class file is serialised again.
So, AFAICT, I should be able to first deserialise the constants table as part of deserialising the whole class file and then pass the constant pool into the functions that deserialise fields, methods etc. so that they are able to look up the constants from the pool.
I have to stare at the Get and Put primitives as well as ReaderT to figure out how all this will work together. Let's call it a learning experience.
As I hinted, ReaderT isn't really necessary either in this case. All it amounts to is passing an extra parameter. In this case it is probably clearer and simpler to explicitly pass the extra parameter, though you can decide that for yourself. Anyway, this documentation page may be more useful: http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/index.html Basically your code will look similar to what it does now except instead of using get you'll use getWord32le or whatever and your own functions that will deserialize more involved structures, e.g. a getTable function.
participants (2)
-
Derek Elkins
-
Lauri Pesonen