Idiomatic way to parse bytecodes

Hello all, I have some bytecodes to parse and I look for the most idiomatic way to do it. It's mostly JVM Bytecode, long story short, I have some embedded JVM Bytecodes: * one part is specification-defined, from 0x00 to 0xB8, differ from the regular JVM Bytecode * and the other part is proprietary and changes often I have two main goals: * print the signification of each bytecodes (a more explicit one than the hexadecimal value of the bytecode) * Do some analysis on it, spot patterns, apply markov model on it and so on. The issue with JVM Bytecode in general is that each bytecode hasn't been made equal. Some of them requires parameters, some of them modify the length of the next bytecode, etc. I'm not looking for a way to parse bytecode quickly, or for a library doing it for me. I look for a simple/composable/idiomatic way to do it. So if you have any thoughts, papers, articles or even code snippets (part of a library of not) that is done in the Haskell-way, please, share it :) Thanks in advance for your help.

Based on what you've said, I would define a Serial or Binary instance for
single bytecodes, and then use this to write a function :: ByteString ->
[Bytecode].
Then, you can convert raw JVM bytecode into some convenient Haskell data,
and then do whatever processing you want to this.
--Will
On Sat, Aug 29, 2015 at 6:12 PM, Gautier DI FOLCO wrote: Hello all, I have some bytecodes to parse and I look for the most idiomatic way to do
it.
It's mostly JVM Bytecode, long story short, I have some embedded JVM
Bytecodes:
* one part is specification-defined, from 0x00 to 0xB8, differ from the
regular JVM Bytecode
* and the other part is proprietary and changes often I have two main goals:
* print the signification of each bytecodes (a more explicit one than the
hexadecimal value of the bytecode)
* Do some analysis on it, spot patterns, apply markov model on it and so
on. The issue with JVM Bytecode in general is that each bytecode hasn't been
made equal. Some of them requires parameters, some of them modify the
length of the next bytecode, etc.
I'm not looking for a way to parse bytecode quickly, or for a library
doing it for me.
I look for a simple/composable/idiomatic way to do it.
So if you have any thoughts, papers, articles or even code snippets (part
of a library of not) that is done in the Haskell-way, please, share it :) Thanks in advance for your help. _______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

On Sun, Aug 30, 2015 at 6:12 AM, Gautier DI FOLCO wrote: I have two main goals:
* print the signification of each bytecodes (a more explicit one than the
hexadecimal value of the bytecode)
* Do some analysis on it, spot patterns, apply markov model on it and so
on. Work backwards, a piece at a time.
The output of parsing is a piece of structured data. So you need to design
your data type. Start with the bare minimum. Write the pretty printing
code. What do the type signatures of your analysis functions look like?
E.g. "Apply markov model" is way too vague.
The key thing is to iterate and improve on the design of your data type.
Once the data type looks good, the implementation of parsing becomes clear
as day.
-- Kim-Ee

2015-08-30 5:46 GMT+02:00 Kim-Ee Yeoh
On Sun, Aug 30, 2015 at 6:12 AM, Gautier DI FOLCO < gautier.difolco@gmail.com> wrote:
I have two main goals: * print the signification of each bytecodes (a more explicit one than the hexadecimal value of the bytecode) * Do some analysis on it, spot patterns, apply markov model on it and so on.
Work backwards, a piece at a time.
The output of parsing is a piece of structured data. So you need to design your data type. Start with the bare minimum. Write the pretty printing code. What do the type signatures of your analysis functions look like?
E.g. "Apply markov model" is way too vague.
The key thing is to iterate and improve on the design of your data type.
Once the data type looks good, the implementation of parsing becomes clear as day.
-- Kim-Ee
Hello, Thanks for your answer. I think I have misled you in the expression of my needs. I don't have any representation issues, I'm just looking for a good design for the parsing phase. Here's some naive iterative tries: 0. We only have one bytecode per instruction a simple HashMap is enough: import Data.Word import Data.Maybe import qualified Data.Map as H type Byte = Word8 data ByteCodeSimpleInstruction = Astore_0 -- 0x4B | Astore_1 -- 0x4C | Astore_2 -- 0x4D | Astore_3 -- 0x4E deriving (Eq, Show) sampleSI :: [Byte] sampleSI = [0x4B, 0x4C, 0x4D, 0x4E] convertTableSI :: H.Map Byte ByteCodeSimpleInstruction convertTableSI = H.fromList $ [ (0x4B, Astore_0), (0x4C, Astore_1), (0x4D, Astore_2), (0x4E, Astore_3) ] parserSI :: [Byte] -> [ByteCodeSimpleInstruction] parserSI = map $ fromJust . flip H.lookup convertTableSI 1. We add instruction with different lengths, I'm "forced" to introduce pattern matching and it makes me sad: data ByteCodeVariableLengthInstruction = Astore' Byte -- 0x19 | Astore_0' -- 0x4B | Astore_1' -- 0x4C | Astore_2' -- 0x4D | Astore_3' -- 0x4E deriving (Eq, Show) sampleVLI :: [Byte] sampleVLI = [0x4B, 0x4C, 0x19, 0x2A, 0x4D, 0x4E] parserVLI :: [Byte] -> [ByteCodeVariableLengthInstruction] parserVLI b = case b of (0x19:p:n) -> (Astore' p):parserVLI n (0x4B:n) -> Astore_0':parserVLI n (0x4C:n) -> Astore_1':parserVLI n (0x4D:n) -> Astore_2':parserVLI n (0x4E:n) -> Astore_3':parserVLI n [] -> [] 2. We add instructions that change the next instruction's length, we are "forced" to add different parsing strategies: data ByteCodeVariableLengthParameterInstruction = Astore'' ByteParameter -- 0x19 | Astore_0'' -- 0x4B | Astore_1'' -- 0x4C | Astore_2'' -- 0x4D | Astore_3'' -- 0x4E | Wide'' -- 0xDD deriving (Eq, Show) data ByteParameter = Simple Byte | Double Byte Byte deriving (Eq, Show) sampleVLPI :: [Byte] sampleVLPI = [0x4B, 0x4C, 0xDD, 0x19, 0xA2, 0x2A, 0x4D, 0x4E] parserVLPI :: [Byte] -> [ByteCodeVariableLengthParameterInstruction] parserVLPI = parserVLPISimple parserVLPISimple :: [Byte] -> [ByteCodeVariableLengthParameterInstruction] parserVLPISimple b = case b of (0x19:p:n) -> (Astore'' (Simple p)):parserVLPISimple n (0x4B:n) -> Astore_0'':parserVLPISimple n (0x4C:n) -> Astore_1'':parserVLPISimple n (0x4D:n) -> Astore_2'':parserVLPISimple n (0x4E:n) -> Astore_3'':parserVLPISimple n (0xDD:n) -> Wide'':parserVLPIDouble n [] -> [] parserVLPIDouble :: [Byte] -> [ByteCodeVariableLengthParameterInstruction] parserVLPIDouble b = case b of (0x19:p:q:n) -> (Astore'' (Double p q)):parserVLPISimple n _ -> parserVLPISimple b I feel that are too "ad-hoc" tactics and I'm looking for an higher-order way to do it. I don't know if I'm clearer now. Thanks in advance for you help. Regards. PS: For the Markov part, Markov models will help me to spot the most common sequences of bytecodes. Then I'll be able to create proprietary byte that will be the most likely to decrease my code's size.

On 31/08/2015, at 10:35 am, Gautier DI FOLCO
Thanks for your answer. I think I have misled you in the expression of my needs. I don't have any representation issues, I'm just looking for a good design for the parsing phase.
Depending on what you mean by "parsing", a very simple answer may suffice. Use a TABLE-DRIVEN approach. Let me take a Smalltalk byte code as an example. Instructions fall into a number of groups. For example, we have push_local <number> where <number> might be encoded in 2 bytes, in 1 byte, or it might be implied in the opcode. So we'd have something like data Operator = Push_Local | Store_Local | ... data Operand = None | Implied Int | One_Byte | Two_Byte | ... decode :: Array Int (Operator, Operand) decode = array (0,255) [ (0, (Push_Local, Implied 0)), (1, (Push_Local, Implied 1)), ... (8, (Push_Local, One_Byte)), (9, (Push_Local, Two_Byte)), ... Or even make it a function: decode :: Int -> (Operator,Operand) decode 0 = ... ... decode 255 = ... then for example extract :: Operand -> Bytes -> Int -> (Simple_Operand, Int) extract None b i = (No_Operand, i) extract (Implied n) b i = (Int_Operand n, i) extract (One_Byte) b i = (Int_Operand (byte b i), i+1) extract (Two_Byte) b i = (Int_Operand (byte b i * 256 + byte b (i+1)),i+2) ... Ideally, you'd switch to a different variant of the instruction set by simply changing the table. When you have "portmanteau instructions", it gets a bit trickier. For example, Smalltalk systems typically have Store_Local <index> -- store the TOS in that variable Pop -- remove the TOS combined in Pop_Into_Local <index> -- store TOS into variable and remove it
PS: For the Markov part, Markov models will help me to spot the most common sequences of bytecodes. Then I'll be able to create proprietary byte that will be the most likely to decrease my code's size.
There's a Xerox blue-and-white report. Ah, here we go. http://www.textfiles.com/bitsavers/pdf/xerox/parc/techReports/CSL-82-2_An_An... Mesa was an "industrial strength Pascal" running on the Alto and the D-machines. The author collected dynamic frequencies of byte pairs. "It is hard to learn much of general interest by examining individual instruction frequencies of a small set of programs, since those statistics may highlight idiosyncratic behavior of the programs. For example, only three of the top eight most frequently executed instructions in both VlsiCheck and the compiler are the same." There's also the companion paper from ASPLOS'82, "Static Analysis of the Mesa Instruction Set". R.Sweet, J.Sandman. The obvious question is WHY you want to compress your code size in this way. That report (if only I could remember title or author, sigh, pointed out that STATIC instruction (and operand) frequencies and DYNAMIC instruction (and operand) frequencies can be quite different. A fair bit of fiddling has gone on in the Smalltalk world, so that although practically everyone started with something like the Blue Book model (which is still probably *the* reference for how byte code systems work). Those people are trying to balance *compact* encoding against *efficient* decoding. If you just want to reduce size you might think in terms of just compressing everything, with decompression of code that's actually executed as "Level 0 JIT".
participants (4)
-
Gautier DI FOLCO
-
Kim-Ee Yeoh
-
Richard A. O'Keefe
-
William Yager