
On 31/08/2015, at 10:35 am, Gautier DI FOLCO
Thanks for your answer. I think I have misled you in the expression of my needs. I don't have any representation issues, I'm just looking for a good design for the parsing phase.
Depending on what you mean by "parsing", a very simple answer may suffice. Use a TABLE-DRIVEN approach. Let me take a Smalltalk byte code as an example. Instructions fall into a number of groups. For example, we have push_local <number> where <number> might be encoded in 2 bytes, in 1 byte, or it might be implied in the opcode. So we'd have something like data Operator = Push_Local | Store_Local | ... data Operand = None | Implied Int | One_Byte | Two_Byte | ... decode :: Array Int (Operator, Operand) decode = array (0,255) [ (0, (Push_Local, Implied 0)), (1, (Push_Local, Implied 1)), ... (8, (Push_Local, One_Byte)), (9, (Push_Local, Two_Byte)), ... Or even make it a function: decode :: Int -> (Operator,Operand) decode 0 = ... ... decode 255 = ... then for example extract :: Operand -> Bytes -> Int -> (Simple_Operand, Int) extract None b i = (No_Operand, i) extract (Implied n) b i = (Int_Operand n, i) extract (One_Byte) b i = (Int_Operand (byte b i), i+1) extract (Two_Byte) b i = (Int_Operand (byte b i * 256 + byte b (i+1)),i+2) ... Ideally, you'd switch to a different variant of the instruction set by simply changing the table. When you have "portmanteau instructions", it gets a bit trickier. For example, Smalltalk systems typically have Store_Local <index> -- store the TOS in that variable Pop -- remove the TOS combined in Pop_Into_Local <index> -- store TOS into variable and remove it
PS: For the Markov part, Markov models will help me to spot the most common sequences of bytecodes. Then I'll be able to create proprietary byte that will be the most likely to decrease my code's size.
There's a Xerox blue-and-white report. Ah, here we go. http://www.textfiles.com/bitsavers/pdf/xerox/parc/techReports/CSL-82-2_An_An... Mesa was an "industrial strength Pascal" running on the Alto and the D-machines. The author collected dynamic frequencies of byte pairs. "It is hard to learn much of general interest by examining individual instruction frequencies of a small set of programs, since those statistics may highlight idiosyncratic behavior of the programs. For example, only three of the top eight most frequently executed instructions in both VlsiCheck and the compiler are the same." There's also the companion paper from ASPLOS'82, "Static Analysis of the Mesa Instruction Set". R.Sweet, J.Sandman. The obvious question is WHY you want to compress your code size in this way. That report (if only I could remember title or author, sigh, pointed out that STATIC instruction (and operand) frequencies and DYNAMIC instruction (and operand) frequencies can be quite different. A fair bit of fiddling has gone on in the Smalltalk world, so that although practically everyone started with something like the Blue Book model (which is still probably *the* reference for how byte code systems work). Those people are trying to balance *compact* encoding against *efficient* decoding. If you just want to reduce size you might think in terms of just compressing everything, with decompression of code that's actually executed as "Level 0 JIT".