Regarding Yhc bytecode versioning

Yhc hackers, Several weeks ago I received a report that my bytecode library for Yhc was not working correctly. I investigated the matter and discovered several reasons for the problem. I've fixed the bugs and released a new version of the library. The problems were: * Idiocy on my part. I somehow managed to get the minor version of the Yhc bytecode set wrong. I thought it was 9, but it actually is 10. * Compatibility-breaking changes to the bytecode file format. The second problem is the one I wish to discuss. Since I began work on the Yhc bytecode library in May, there have been at least two instances of compatibility-breaking changes. One relates to Hat integration, I believe, and the other has to do with the switch to libFFI. Both of these changes were made without bumping the version number that appears in the file header. I would like to suggest that such changes be avoided in the future. It seems to me that Yhc is fairly rapidly approaching a feature- complete release, and I think we should start thinking pretty seriously about stability issues. If it becomes necessary to somehow modify the file format, then I feel that we should be careful to document the changes, and be sure to bump the version number. That way we can rely on the stated version number to reliably identify the proper parsing procedures for a bytecode file. Without this basic guarantee, it becomes very difficult to achieve interoperability. It may also be a good time to think about a ways to future-proof the file format so that future additions can be made without breaking compatibility. Right now the format is quite fragile. Perhaps we could take inspiration from the Java classfile format. The basic idea is that there are named blocks of data with a minimal header which gives the name of the block and the size of the data payload. The name of the block defines the meaning of the data. Eg, the 'CODE' block contains bytecode instructions, etc. If any block is encountered with an unrecognized name, it is ignored. That way, one can have optional blocks, or one can add blocks without breaking compatibility. One can also have optional information (like debugging symbols) and things of that nature. What do you think? Rob Dockins Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG

Hi Robert,
* Compatibility-breaking changes to the bytecode file format.
The second problem is the one I wish to discuss. Since I began work on the Yhc bytecode library in May, there have been at least two instances of compatibility-breaking changes. One relates to Hat integration, I believe, and the other has to do with the switch to libFFI.
Both of these changes were made without bumping the version number that appears in the file header. I would like to suggest that such changes be avoided in the future.
While there is still work ongoing I think its unfortunate but a definate reality that things will have to be broken in binary file formats - we want to put the .hi information in .hbc files, that will require breaking. We want to do a linking pass to merge multiple .hbc's into one - that will require breaking. However, you're entirely right, any change that breaks anything from now on needs a version bump.
It may also be a good time to think about a ways to future-proof the file format so that future additions can be made without breaking compatibility. Right now the format is quite fragile. Perhaps we could take inspiration from the Java classfile format. The basic idea is that there are named blocks of data with a minimal header which gives the name of the block and the size of the data payload. The name of the block defines the meaning of the data. Eg, the 'CODE' block contains bytecode instructions, etc. If any block is encountered with an unrecognized name, it is ignored. That way, one can have optional blocks, or one can add blocks without breaking compatibility. One can also have optional information (like debugging symbols) and things of that nature.
That was always the intention, I'm hoping that once we move to having Yhc.ByteCode handle everything, we can treat that as an abstraction over the file format, and then we can work on defining a new .hbc file format designed to last a very long time without changes. Me and Tom did a brain storm a while on the "perfect" .hbc file format, but unfortunately we've never had time to implement or document it... As a side note, all these issues apply equally to .ycr files, for which there are now 3 projects making active use of. For that I have defined a Haskell interface which is the only supported way of getting at the data (that can't be done for the .hbc files, as the C needs access to them). I am also very agressively bumping the version number - the tiniest change gets a new version. I am also ignoring backwards compatability, at every version bump I just ignore all old files. Thanks Neil

On 10/29/06, Robert Dockins
It may also be a good time to think about a ways to future-proof the file format so that future additions can be made without breaking compatibility. Right now the format is quite fragile. Perhaps we could take inspiration from the Java classfile format. The basic idea is that there are named blocks of data with a minimal header which gives the name of the block and the size of the data payload. The name of the block defines the meaning of the data. Eg, the 'CODE' block contains bytecode instructions, etc. If any block is encountered with an unrecognized name, it is ignored. That way, one can have optional blocks, or one can add blocks without breaking compatibility. One can also have optional information (like debugging symbols) and things of that nature.
Do they all have four character names in the java classfile format? Like in PNG or IFF or RIFF(the basis of the horrid AVI container format, among other things)?

On Oct 30, 2006, at 6:54 AM, Samuel Bronson wrote:
On 10/29/06, Robert Dockins
wrote: It may also be a good time to think about a ways to future-proof the file format so that future additions can be made without breaking compatibility. Right now the format is quite fragile. Perhaps we could take inspiration from the Java classfile format. The basic idea is that there are named blocks of data with a minimal header which gives the name of the block and the size of the data payload. The name of the block defines the meaning of the data. Eg, the 'CODE' block contains bytecode instructions, etc. If any block is encountered with an unrecognized name, it is ignored. That way, one can have optional blocks, or one can add blocks without breaking compatibility. One can also have optional information (like debugging symbols) and things of that nature.
Do they all have four character names in the java classfile format?
Um... I'd have to look it up (its been awhile since I worked with this), but I'm pretty sure it just references the string table and that the names can be an arbitrary string. At any rate, that's certainly how I would do it. I'd probably suggest that something like URIs be used for block names if this format is eventually adopted. URIs give a nice wide namespace and portions of it can be parceled out pretty easily.
Like in PNG or IFF or RIFF(the basis of the horrid AVI container format, among other things)?
Rob Dockins Speak softly and drive a Sherman tank. Laugh hard; it's a long way to the bank. -- TMBG
participants (3)
-
Neil Mitchell
-
Robert Dockins
-
Samuel Bronson