2015-08-30 5:46 GMT+02:00 Kim-Ee Yeoh <ky3@atamo.com>:

On Sun, Aug 30, 2015 at 6:12 AM, Gautier DI FOLCO <gautier.difolco@gmail.com> wrote:
 
I have two main goals:
 * print the signification of each bytecodes (a more explicit one than the hexadecimal value of the bytecode)
 * Do some analysis on it, spot patterns, apply markov model on it and so on.

Work backwards, a piece at a time.

The output of parsing is a piece of structured data. So you need to design your data type. Start with the bare minimum. Write the pretty printing code. What do the type signatures of your analysis functions look like?

E.g. "Apply markov model" is way too vague.

The key thing is to iterate and improve on the design of your data type.

Once the data type looks good, the implementation of parsing becomes clear as day.

-- Kim-Ee

Hello,

Thanks for your answer.
I think I have misled you in the expression of my needs.
I don't have any representation issues, I'm just looking for a good design for the parsing phase.
Here's some naive iterative tries:
0. We only have one bytecode per instruction a simple HashMap is enough:
import Data.Word
import Data.Maybe
import qualified Data.Map as H
type Byte = Word8

data ByteCodeSimpleInstruction =
      Astore_0 -- 0x4B
    | Astore_1 -- 0x4C
    | Astore_2 -- 0x4D
    | Astore_3 -- 0x4E
    deriving (Eq, Show)

sampleSI :: [Byte]
sampleSI = [0x4B, 0x4C, 0x4D, 0x4E]

convertTableSI :: H.Map Byte ByteCodeSimpleInstruction
convertTableSI = H.fromList $ [
      (0x4B, Astore_0),
      (0x4C, Astore_1),
      (0x4D, Astore_2),
      (0x4E, Astore_3)
    ]

parserSI :: [Byte] -> [ByteCodeSimpleInstruction]
parserSI = map $ fromJust . flip H.lookup convertTableSI

1. We add instruction with different lengths, I'm "forced" to introduce pattern matching and it makes me sad:
data ByteCodeVariableLengthInstruction =
      Astore' Byte -- 0x19
    | Astore_0' -- 0x4B
    | Astore_1' -- 0x4C
    | Astore_2' -- 0x4D
    | Astore_3' -- 0x4E
    deriving (Eq, Show)

sampleVLI :: [Byte]
sampleVLI = [0x4B, 0x4C, 0x19, 0x2A, 0x4D, 0x4E]

parserVLI :: [Byte] -> [ByteCodeVariableLengthInstruction]
parserVLI b = case b of
    (0x19:p:n) -> (Astore' p):parserVLI n
    (0x4B:n)   -> Astore_0':parserVLI n
    (0x4C:n)   -> Astore_1':parserVLI n
    (0x4D:n)   -> Astore_2':parserVLI n
    (0x4E:n)   -> Astore_3':parserVLI n
    []         -> []

2. We add instructions that change the next instruction's length, we are "forced" to add different parsing strategies:
data ByteCodeVariableLengthParameterInstruction =
      Astore'' ByteParameter -- 0x19
    | Astore_0'' -- 0x4B
    | Astore_1'' -- 0x4C
    | Astore_2'' -- 0x4D
    | Astore_3'' -- 0x4E
    | Wide'' -- 0xDD
    deriving (Eq, Show)

data ByteParameter =
      Simple Byte
    | Double Byte Byte
    deriving (Eq, Show)

sampleVLPI :: [Byte]
sampleVLPI = [0x4B, 0x4C, 0xDD, 0x19, 0xA2, 0x2A, 0x4D, 0x4E]

parserVLPI :: [Byte] -> [ByteCodeVariableLengthParameterInstruction]
parserVLPI = parserVLPISimple

parserVLPISimple :: [Byte] -> [ByteCodeVariableLengthParameterInstruction]
parserVLPISimple b = case b of
    (0x19:p:n) -> (Astore'' (Simple p)):parserVLPISimple n
    (0x4B:n)   -> Astore_0'':parserVLPISimple n
    (0x4C:n)   -> Astore_1'':parserVLPISimple n
    (0x4D:n)   -> Astore_2'':parserVLPISimple n
    (0x4E:n)   -> Astore_3'':parserVLPISimple n
    (0xDD:n)   -> Wide'':parserVLPIDouble n
    []         -> []

parserVLPIDouble :: [Byte] -> [ByteCodeVariableLengthParameterInstruction]
parserVLPIDouble b = case b of
    (0x19:p:q:n) -> (Astore'' (Double p q)):parserVLPISimple n
    _            -> parserVLPISimple b


I feel that are too "ad-hoc" tactics and I'm looking for an higher-order way to do it.
I don't know if I'm clearer now.

Thanks in advance for you help.
Regards.

PS: For the Markov part, Markov models will help me to spot the most common sequences of bytecodes. Then I'll be able to create proprietary byte that will be the most likely to decrease my code's size.