Parsec to parse tree structures?

Hello Café Some time ago I wrote a parser for a project of one our customers. The format was proprietary and binary. The data was structured as a tree with tables pointing to sub tables farther in the file. (Well actually there was one or two cases where branches joined together, so I guess it was a directed graph.) Also it had no defined endianess, some tables were bigendian others were little endian and some were in their very own in-house-endianess. All in all, the typical binary data structure that has been in use and continuously extended for the last 15 years. You know the kind. Oddly enough, our customer never bothered to write a parser of their own. I wonder why. My parser was written in C# and wasn't particularly elegant, but it worked reliably. I was wondering how you would parse tree-like structures with Parsec (or other functional parsers)? Up to know, all examples I've seen were of sequential data. To parse trees, you'd essentially need to be able to follow pointers, parse whatever is there and then jump back. I guess I'd have to mess around with the internal state of the parser to do that, which is something I'd rather avoid. regards, dave

On 14 March 2010 16:03, david fries
Oddly enough, our customer never bothered to write a parser of their own. I wonder why.
Hi David If the binary structure was previously used only with C programs its quite common just to use casting to unpack the data into a struct - though your example seems to suggest this wasn't being done as the format had both big and little endian tables. In Haskell or other modern functional languages like SML, parse trees are generally represented as algebraic types - so there are no pointers. If you're familiar with ANTLR from the OO world, its rather like working with the tree definition automatically as opposed to generating classes from the data description language. Best wishes Stephen

Hi Stephen Perhaps my description of the format was a bit unclear. When I said pointer I simply meant a number which is the position in the stream. Imagine the tables looking something like this: RootTable HeaderMagicNumber (1Byte): 0x50 VersionNumber (2 Bytes): 1234 SubTablePointer (4 Bytes): 200 ------------. ...Some more fields of the root table ¦ ¦ ¦ SubTable (positioned at byte 200 of the file)<' SomeFlags (1 Byte): 0x00 Comment (Variable String): "hello, world!\0" AnotherTable ... Our parser produced object instances representing each table. And I used normal references between tables instead of pointers. You had to start parsing at the beginning of the file (where the root table is), otherwise you have no clue where in the structure you are. Because all tables after the root table are dynamically positioned when the whole thing is serialized. So, to parse the root, you read the first byte, check that it's 0x50, read the next two bytes (which are the version number), then you read four bytes (pointer to the sub table). The sub table starts at byte 200 of the file. So now I would jump to that position in the file and start parsing SubTable. After that I'd jump back and parse the remaining fields of the root table. On Sun, 2010-03-14 at 16:23 +0000, Stephen Tetley wrote:
On 14 March 2010 16:03, david fries
wrote: [SNIP] Oddly enough, our customer never bothered to write a parser of their own. I wonder why.
Hi David
If the binary structure was previously used only with C programs its quite common just to use casting to unpack the data into a struct - though your example seems to suggest this wasn't being done as the format had both big and little endian tables.
In Haskell or other modern functional languages like SML, parse trees are generally represented as algebraic types - so there are no pointers. If you're familiar with ANTLR from the OO world, its rather like working with the tree definition automatically as opposed to generating classes from the data description language.
Best wishes
Stephen _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi David Ah ha - this form of binary file layout is quite common (e.g. PECOFF object files and OpenType / TrueType fonts). Parsec and other parsing libraries are perhaps not ideal for the task, as they consume input as they parse. I have my own alternative to Parsec - Kangaroo [1] - for parsing binary files. It moves a cursor around inside the file (strictly speaking an array in memory from reading the file), so you can parse within a sub-region of the file and jump back out again. Although its on Hackage, I wouldn't really recommend its use - its now fairly well documented but the API is not stable and I only work on it sporadically. Because I didn't want any dependencies, the package is quite a bit larger than it need be - if someone were interested in technique they might be better off using it as a start point. The most important bits are the 'intraparse' function and the monadic machinery inside the Kangaroo.ParseMonad module. Even when a binary format has a published standard, unfortunately the standard might not be detailed enough to actually produce a parser. This is the case for True Type and PECOFF which I wrote Kangaroo for, and as I don't have much enthusiasm for deriving a parser from another open-source implementation, its rather stalling any continued development of Kangaroo. [1] http://hackage.haskell.org/package/kangaroo Best wishes Stephen

Yep, That's what I had in mind. Thanks for the link. On Sun, 2010-03-14 at 19:09 +0000, Stephen Tetley wrote:
Hi David
Ah ha - this form of binary file layout is quite common (e.g. PECOFF object files and OpenType / TrueType fonts).
Parsec and other parsing libraries are perhaps not ideal for the task, as they consume input as they parse. I have my own alternative to Parsec - Kangaroo [1] - for parsing binary files. It moves a cursor around inside the file (strictly speaking an array in memory from reading the file), so you can parse within a sub-region of the file and jump back out again.
Although its on Hackage, I wouldn't really recommend its use - its now fairly well documented but the API is not stable and I only work on it sporadically. Because I didn't want any dependencies, the package is quite a bit larger than it need be - if someone were interested in technique they might be better off using it as a start point. The most important bits are the 'intraparse' function and the monadic machinery inside the Kangaroo.ParseMonad module.
Even when a binary format has a published standard, unfortunately the standard might not be detailed enough to actually produce a parser. This is the case for True Type and PECOFF which I wrote Kangaroo for, and as I don't have much enthusiasm for deriving a parser from another open-source implementation, its rather stalling any continued development of Kangaroo.
[1] http://hackage.haskell.org/package/kangaroo
Best wishes
Stephen _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Here are some questions you have not answered that are quite important
for your design:
1) How important is retaining shared references? In particular, is
the structure mutable after being read-in to memory? If it is, and
you mutate an object, do you expect other references to that object to
be mutated as well?
2) If the answer to (1) is "not important", then how important is
parsing performance? If you parse the same object more than once
(turning the directed graph into a tree), is that a problem?
This is an interesting problem and of course there are a lot of ways
to tackle it, but I think you need to elaborate a bit more on what is
needed. If you want to maintain the "object graph" structure, you'll
treat the problem quite a bit differently than if you are happy to
convert it to a pure data structure.
-- ryan
On Sun, Mar 14, 2010 at 9:03 AM, david fries
Hello Café
Some time ago I wrote a parser for a project of one our customers. The format was proprietary and binary. The data was structured as a tree with tables pointing to sub tables farther in the file. (Well actually there was one or two cases where branches joined together, so I guess it was a directed graph.) Also it had no defined endianess, some tables were bigendian others were little endian and some were in their very own in-house-endianess. All in all, the typical binary data structure that has been in use and continuously extended for the last 15 years. You know the kind.
Oddly enough, our customer never bothered to write a parser of their own. I wonder why.
My parser was written in C# and wasn't particularly elegant, but it worked reliably. I was wondering how you would parse tree-like structures with Parsec (or other functional parsers)? Up to know, all examples I've seen were of sequential data. To parse trees, you'd essentially need to be able to follow pointers, parse whatever is there and then jump back. I guess I'd have to mess around with the internal state of the parser to do that, which is something I'd rather avoid.
regards, dave
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Hi Ryan, 1) Retaining shared references was not important to us at the time. We were only interested in the correctness of the values that were written. 2) Performance wasn't a big issue either. The parser was part of a debugging tool we wrote for internal use. My my concern was how you would perform random access in a functional parser. You're points are interesting too. I guess if we really had wanted to work with parsed objects, retaining the shared references would have been a must. On Mon, 2010-03-15 at 14:57 -0700, Ryan Ingram wrote:
Here are some questions you have not answered that are quite important for your design:
1) How important is retaining shared references? In particular, is the structure mutable after being read-in to memory? If it is, and you mutate an object, do you expect other references to that object to be mutated as well?
2) If the answer to (1) is "not important", then how important is parsing performance? If you parse the same object more than once (turning the directed graph into a tree), is that a problem?
This is an interesting problem and of course there are a lot of ways to tackle it, but I think you need to elaborate a bit more on what is needed. If you want to maintain the "object graph" structure, you'll treat the problem quite a bit differently than if you are happy to convert it to a pure data structure.
-- ryan
On Sun, Mar 14, 2010 at 9:03 AM, david fries
wrote: Hello Café
Some time ago I wrote a parser for a project of one our customers. The format was proprietary and binary. The data was structured as a tree with tables pointing to sub tables farther in the file. (Well actually there was one or two cases where branches joined together, so I guess it was a directed graph.) Also it had no defined endianess, some tables were bigendian others were little endian and some were in their very own in-house-endianess. All in all, the typical binary data structure that has been in use and continuously extended for the last 15 years. You know the kind.
Oddly enough, our customer never bothered to write a parser of their own. I wonder why.
My parser was written in C# and wasn't particularly elegant, but it worked reliably. I was wondering how you would parse tree-like structures with Parsec (or other functional parsers)? Up to know, all examples I've seen were of sequential data. To parse trees, you'd essentially need to be able to follow pointers, parse whatever is there and then jump back. I guess I'd have to mess around with the internal state of the parser to do that, which is something I'd rather avoid.
regards, dave
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

david fries wrote:
My my concern was how you would perform random access in a functional parser. You're points are interesting too. I guess if we really had wanted to work with parsed objects, retaining the shared references would have been a must.
Out of curiosity, was there *really* a need for random access? It sounds like you could've written things to do it in one forward pass, keeping a table on the side of the targets for dangling pointers, and then back-patching once you've reached the object being pointed at. You may even be able to use laziness to do the back-patching instead of needing STRefs. The only problem I could see with that approach is if there was funny bit-packing going on so that objects were overlapped in the same memory region. -- Live well, ~wren

hi wren Where I've used it, random access does seem conceptual more satisfactory than trying to avoid it. Well designed binary formats are deterministic - so wherever you are inside the file you should know what you are parsing. One example of this determinism is that parsing "local" alternatives are generally encoded on a single tag, whereas in a parser for a programming language parsing alternatives might require lookahead and possibly other disambiguation. Another example is that formats will often have a "index table" table at the start of the file giving start position and length for all the sub tables - this saves you from having to start each table with a tag and length. For complex formats e.g. PECOFF or TrueType, you might only want to parse certain tables [*]. After parsing the index table you could build a list of parsers to run sequentially on the body of the file (including parsers that just drop unwanted tables), but this seems too much work (and too much to go wrong), when I can use a function almost as simple as parseAt for the tables I'm interested in. parseAt :: Start -> End -> Parser a -> Parser a In practice, I made parseAt slightly more complicated so it could encode where the cursor is moved to at the end of the parse. Best wishes Stephen [*] Certain tables in TrueType / OpenType are propriety - it might be unwise for an open source parser to even include parsing operations for those tables.

Stephen Tetley wrote:
hi wren
Where I've used it, random access does seem conceptual more satisfactory than trying to avoid it.
I was thinking more about performance issues (avoiding disk seeks) which would also alleviate the problem of needing random access when it's not available.
For complex formats e.g. PECOFF or TrueType, you might only want to parse certain tables [*]. After parsing the index table you could build a list of parsers to run sequentially on the body of the file (including parsers that just drop unwanted tables), but this seems too much work (and too much to go wrong)
Even still, you could linearize the access in order to minimize seeking back and forth. The linearization could even be done automatically by having the table of what needs backpatching also serve as a work queue (when done with a block, seek to the closest next block; when the queue is empty, you're done). The backpatching queue also preserves structure sharing. Perhaps I've just been parsing too many multigigabyte files lately... :) -- Live well, ~wren

On Sun, Mar 14, 2010 at 9:03 AM, david fries
Hello Café
Some time ago I wrote a parser for a project of one our customers. The format was proprietary and binary. The data was structured as a tree with tables pointing to sub tables farther in the file. (Well actually there was one or two cases where branches joined together, so I guess it was a directed graph.) Also it had no defined endianess, some tables were bigendian others were little endian and some were in their very own in-house-endianess. All in all, the typical binary data structure that has been in use and continuously extended for the last 15 years. You know the kind.
Oddly enough, our customer never bothered to write a parser of their own. I wonder why.
My parser was written in C# and wasn't particularly elegant, but it worked reliably. I was wondering how you would parse tree-like structures with Parsec (or other functional parsers)? Up to know, all examples I've seen were of sequential data. To parse trees, you'd essentially need to be able to follow pointers, parse whatever is there and then jump back. I guess I'd have to mess around with the internal state of the parser to do that, which is something I'd rather avoid.
Would binary or cereal be right for this task? http://hackage.haskell.org/packages/archive/binary/0.4.1/doc/html/Data-Binar... http://hackage.haskell.org/packages/archive/cereal/0.2/doc/html/Data-Seriali... My understanding is they provide low-level ways to get at bytes in a chunk of memory. Using them, I believe you'd read the data into a bytestring and then load in the bytes and seek/jump as needed. I don't know if they support the backwards jumping though. And, I think ideally you use it by defining a type class instance for each object in the binary format and then use the provided type classes to just get/put your format. If the representation is very much like a C struct, you might consider Storable. Although, I've never heard of anyone using it as a parser. Jason
participants (5)
-
david fries
-
Jason Dagit
-
Ryan Ingram
-
Stephen Tetley
-
wren ng thornton