(I'm gonna forward this response to the jhc list too, if that's okay) On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote:
I've been wondering about your use of Ids in Jhc. As far as I can see, the rules for Ids goes as following: * named ids are odd * unnamed ids are even * id '0' has special significance.
Can you tell me a bit about how unnamed ids are used, why id '0' has special meaning, and how you generate new, unique unnamed ids.
Yes. that is correct. named and unnamed ids behave the same way, except named ids get a pretty name when printed out as opposed to just a numebr and are generally derived from a user supplied name. Also, all 'top level' names are always 'named' to ensure they remain globally unique. named Ids are odd, these are looked up in a table that is implemented via a hash table written in c in the StringTable/ directory. due to an intentional quirk of its implementation, it only generates odd, positive values. Conversion between a number and its string form is constant time in both directions. zero is a 'special' id in that it can be treated normally except it can never appear in an expression, only in a binder. so \ v0 -> ... is okay, but \ x -> .. v0 ... is not. This means you can always use 0 as a binder when you know a value is not used without worrying about shadowing a real definition. unnamed ids are arbitrary positive even numbers, there is no particular need to allocate them in a special way, they are _not_ globally unique so you just need to ensure you don't shadow any existing variables when assigning new ones. in general, this consists of having a set of 'in scope' names (which is often something you need to keep track of anyway) and you choose some value that is not in the 'in scope' set and add it to the set. this ensures ids don't shadow each other, but not that they are unique. negative ids are used inside of a few specialized routines internally and should never make it out into the world. for instance, in the unification algorithm they are used to represent alpha-renamed values so we don't need to scan for used names before hand, however the terms returned have their original ids restored. hope this helps... It is quite bothersome to me that 'Id' is just an Int and not a newtype... that is something I have been slowing getting the code ready to fix for a while now.. John -- John Meacham - ⑆repetae.net⑆john⑈
On Feb 14, 2008 12:50 AM, John Meacham
(I'm gonna forward this response to the jhc list too, if that's okay)
On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote:
I've been wondering about your use of Ids in Jhc. As far as I can see, the rules for Ids goes as following: * named ids are odd * unnamed ids are even * id '0' has special significance.
Can you tell me a bit about how unnamed ids are used, why id '0' has special meaning, and how you generate new, unique unnamed ids.
Yes. that is correct.
named and unnamed ids behave the same way, except named ids get a pretty name when printed out as opposed to just a numebr and are generally derived from a user supplied name. Also, all 'top level' names are always 'named' to ensure they remain globally unique.
named Ids are odd, these are looked up in a table that is implemented via a hash table written in c in the StringTable/ directory. due to an intentional quirk of its implementation, it only generates odd, positive values. Conversion between a number and its string form is constant time in both directions.
zero is a 'special' id in that it can be treated normally except it can never appear in an expression, only in a binder. so \ v0 -> ... is okay, but \ x -> .. v0 ... is not. This means you can always use 0 as a binder when you know a value is not used without worrying about shadowing a real definition.
unnamed ids are arbitrary positive even numbers, there is no particular need to allocate them in a special way, they are _not_ globally unique so you just need to ensure you don't shadow any existing variables when assigning new ones. in general, this consists of having a set of 'in scope' names (which is often something you need to keep track of anyway) and you choose some value that is not in the 'in scope' set and add it to the set. this ensures ids don't shadow each other, but not that they are unique.
negative ids are used inside of a few specialized routines internally and should never make it out into the world. for instance, in the unification algorithm they are used to represent alpha-renamed values so we don't need to scan for used names before hand, however the terms returned have their original ids restored.
hope this helps...
It is quite bothersome to me that 'Id' is just an Int and not a newtype... that is something I have been slowing getting the code ready to fix for a while now..
I'd like to fix it in another way, if you have no objects. By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms are saved ~2344 times each (there's a total of 281338 atoms). Compressing the files with gzip reduces the negative effect on file size but the overhead of parsing 281k atoms is still there. On my system, testing with HelloWorld.hs, parsing those atoms took 16% of the runtime and 27% of the memory usage. I propose scrapping the current system of random atom ids in favor for a per-module map of symbols. Using an ADT for the ids would make the system more transparent. I'll have to investigate the performance impact of this. -- Cheers, Lemmih
On Feb 14, 2008 1:23 AM, Lemmih
On Feb 14, 2008 12:50 AM, John Meacham
wrote: (I'm gonna forward this response to the jhc list too, if that's okay)
On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote:
I've been wondering about your use of Ids in Jhc. As far as I can see, the rules for Ids goes as following: * named ids are odd * unnamed ids are even * id '0' has special significance.
Can you tell me a bit about how unnamed ids are used, why id '0' has special meaning, and how you generate new, unique unnamed ids.
Yes. that is correct.
named and unnamed ids behave the same way, except named ids get a pretty name when printed out as opposed to just a numebr and are generally derived from a user supplied name. Also, all 'top level' names are always 'named' to ensure they remain globally unique.
named Ids are odd, these are looked up in a table that is implemented via a hash table written in c in the StringTable/ directory. due to an intentional quirk of its implementation, it only generates odd, positive values. Conversion between a number and its string form is constant time in both directions.
zero is a 'special' id in that it can be treated normally except it can never appear in an expression, only in a binder. so \ v0 -> ... is okay, but \ x -> .. v0 ... is not. This means you can always use 0 as a binder when you know a value is not used without worrying about shadowing a real definition.
unnamed ids are arbitrary positive even numbers, there is no particular need to allocate them in a special way, they are _not_ globally unique so you just need to ensure you don't shadow any existing variables when assigning new ones. in general, this consists of having a set of 'in scope' names (which is often something you need to keep track of anyway) and you choose some value that is not in the 'in scope' set and add it to the set. this ensures ids don't shadow each other, but not that they are unique.
negative ids are used inside of a few specialized routines internally and should never make it out into the world. for instance, in the unification algorithm they are used to represent alpha-renamed values so we don't need to scan for used names before hand, however the terms returned have their original ids restored.
hope this helps...
It is quite bothersome to me that 'Id' is just an Int and not a newtype... that is something I have been slowing getting the code ready to fix for a while now..
I'd like to fix it in another way, if you have no objects.
By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms are saved ~2344 times each (there's a total of 281338 atoms).
Oh, I should mention that I'm counting TVr's. Those a apparently mostly module names. -- Cheers, Lemmih
On Feb 14, 2008 1:39 AM, Lemmih
On Feb 14, 2008 1:23 AM, Lemmih
wrote: On Feb 14, 2008 12:50 AM, John Meacham
wrote: (I'm gonna forward this response to the jhc list too, if that's okay)
On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote:
I've been wondering about your use of Ids in Jhc. As far as I can see, the rules for Ids goes as following: * named ids are odd * unnamed ids are even * id '0' has special significance.
Can you tell me a bit about how unnamed ids are used, why id '0' has special meaning, and how you generate new, unique unnamed ids.
Yes. that is correct.
named and unnamed ids behave the same way, except named ids get a pretty name when printed out as opposed to just a numebr and are generally derived from a user supplied name. Also, all 'top level' names are always 'named' to ensure they remain globally unique.
named Ids are odd, these are looked up in a table that is implemented via a hash table written in c in the StringTable/ directory. due to an intentional quirk of its implementation, it only generates odd, positive values. Conversion between a number and its string form is constant time in both directions.
zero is a 'special' id in that it can be treated normally except it can never appear in an expression, only in a binder. so \ v0 -> ... is okay, but \ x -> .. v0 ... is not. This means you can always use 0 as a binder when you know a value is not used without worrying about shadowing a real definition.
unnamed ids are arbitrary positive even numbers, there is no particular need to allocate them in a special way, they are _not_ globally unique so you just need to ensure you don't shadow any existing variables when assigning new ones. in general, this consists of having a set of 'in scope' names (which is often something you need to keep track of anyway) and you choose some value that is not in the 'in scope' set and add it to the set. this ensures ids don't shadow each other, but not that they are unique.
negative ids are used inside of a few specialized routines internally and should never make it out into the world. for instance, in the unification algorithm they are used to represent alpha-renamed values so we don't need to scan for used names before hand, however the terms returned have their original ids restored.
hope this helps...
It is quite bothersome to me that 'Id' is just an Int and not a newtype... that is something I have been slowing getting the code ready to fix for a while now..
I'd like to fix it in another way, if you have no objects.
By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms are saved ~2344 times each (there's a total of 281338 atoms).
Oh, I should mention that I'm counting TVr's. Those a apparently mostly module names.
A count of atoms that are forced shows a different result: 258155 saved atoms, 7390 unique atoms. Not as bad as the TVr count but still not good. -- Cheers, Lemmih
On Feb 14, 2008 1:49 AM, Lemmih
On Feb 14, 2008 1:39 AM, Lemmih
wrote: On Feb 14, 2008 1:23 AM, Lemmih
wrote: On Feb 14, 2008 12:50 AM, John Meacham
wrote: (I'm gonna forward this response to the jhc list too, if that's okay)
On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote:
I've been wondering about your use of Ids in Jhc. As far as I can see, the rules for Ids goes as following: * named ids are odd * unnamed ids are even * id '0' has special significance.
Can you tell me a bit about how unnamed ids are used, why id '0' has special meaning, and how you generate new, unique unnamed ids.
Yes. that is correct.
named and unnamed ids behave the same way, except named ids get a pretty name when printed out as opposed to just a numebr and are generally derived from a user supplied name. Also, all 'top level' names are always 'named' to ensure they remain globally unique.
named Ids are odd, these are looked up in a table that is implemented via a hash table written in c in the StringTable/ directory. due to an intentional quirk of its implementation, it only generates odd, positive values. Conversion between a number and its string form is constant time in both directions.
zero is a 'special' id in that it can be treated normally except it can never appear in an expression, only in a binder. so \ v0 -> ... is okay, but \ x -> .. v0 ... is not. This means you can always use 0 as a binder when you know a value is not used without worrying about shadowing a real definition.
unnamed ids are arbitrary positive even numbers, there is no particular need to allocate them in a special way, they are _not_ globally unique so you just need to ensure you don't shadow any existing variables when assigning new ones. in general, this consists of having a set of 'in scope' names (which is often something you need to keep track of anyway) and you choose some value that is not in the 'in scope' set and add it to the set. this ensures ids don't shadow each other, but not that they are unique.
negative ids are used inside of a few specialized routines internally and should never make it out into the world. for instance, in the unification algorithm they are used to represent alpha-renamed values so we don't need to scan for used names before hand, however the terms returned have their original ids restored.
hope this helps...
It is quite bothersome to me that 'Id' is just an Int and not a newtype... that is something I have been slowing getting the code ready to fix for a while now..
I'd like to fix it in another way, if you have no objects.
By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms are saved ~2344 times each (there's a total of 281338 atoms).
Oh, I should mention that I'm counting TVr's. Those a apparently mostly module names.
A count of atoms that are forced shows a different result: 258155 saved atoms, 7390 unique atoms. Not as bad as the TVr count but still not good.
The worst case scenario is: 990666 saved atoms, 9980 unique atoms. -- Cheers, Lemmih
On Thu, Feb 14, 2008 at 01:23:08AM +0100, Lemmih wrote:
I'd like to fix it in another way, if you have no objects.
By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms are saved ~2344 times each (there's a total of 281338 atoms). Compressing the files with gzip reduces the negative effect on file size but the overhead of parsing 281k atoms is still there. On my system, testing with HelloWorld.hs, parsing those atoms took 16% of the runtime and 27% of the memory usage.
I propose scrapping the current system of random atom ids in favor for a per-module map of symbols. Using an ADT for the ids would make the system more transparent. I'll have to investigate the performance impact of this.
What do you mean? just for saving binary files? it already converts them to an ADT on the way out and back in since atom numbers can be different between different runs of the program. If you mean using an ADT internally for ids, stopping doing that is what brought memory usage from gigabytes to megabytes. The text of the atoms only needs to be stored in a single place, also the fact they are int's means they can be unboxed in datatypes, meanding the garbage collector does not even have to care about them, another huge gain. Also, IntSet/IntMap are orders of magnitude faster than Set and Map and sets/maps of ids are the bed and butter of optimization passes. my only issue with them is that it is type Id = Int instead of newtype Id = Id Int I have been working on the binary format with my recent changes, it is now much less of an issue than it used to be in terms of speed and memory usage. I modified it to use bytestrings and split the file into various 'chunks' that can be lazily loaded independently. so when it only needs the dependency information that is all it needs to pull out. and so forth. I also put a hard limit on the length of atoms to 256 bytes, which means I can chop 7 bytes off the storage of each one. By far the biggest win was in Info/Info, which used to store the type of each field alongside of it as a string, now it stores a hash of the type as a number, this reduced the file size (after compression) by about 30%. John -- John Meacham - ⑆repetae.net⑆john⑈
On Thu, Feb 14, 2008 at 4:36 AM, John Meacham
On Thu, Feb 14, 2008 at 01:23:08AM +0100, Lemmih wrote:
I'd like to fix it in another way, if you have no objects.
By my count there are 120 unique atoms in base-1.0.hl. Those 120 atoms are saved ~2344 times each (there's a total of 281338 atoms). Compressing the files with gzip reduces the negative effect on file size but the overhead of parsing 281k atoms is still there. On my system, testing with HelloWorld.hs, parsing those atoms took 16% of the runtime and 27% of the memory usage.
I propose scrapping the current system of random atom ids in favor for a per-module map of symbols. Using an ADT for the ids would make the system more transparent. I'll have to investigate the performance impact of this.
What do you mean? just for saving binary files? it already converts them to an ADT on the way out and back in since atom numbers can be different between different runs of the program.
I mean something like: data Id = Phantom | Unnamed Int | Named Atom Giving special meaning to numbers seems like a hack. Optimizations should not come at the sacrifice of readability. Indexing atoms by a random number is also something that can and should be avoided.
If you mean using an ADT internally for ids, stopping doing that is what brought memory usage from gigabytes to megabytes. The text of the atoms only needs to be stored in a single place, also the fact they are int's means they can be unboxed in datatypes, meanding the garbage collector does not even have to care about them, another huge gain.
Only keeping a single instance of each unique string in memory is obviously the way to go. I'm not arguing against that. However, given that there are only a total of ~9000 unique strings, unboxing will most likely mean nothing.
Also, IntSet/IntMap are orders of magnitude faster than Set and Map and sets/maps of ids are the bed and butter of optimization passes.
IntSet and IntMap are truly very fast. However, how that maps into CPU time and memory usage is not known. Readability may have been sacrificed for no more than a few percents improvement.
my only issue with them is that it is
type Id = Int instead of newtype Id = Id Int
I have been working on the binary format with my recent changes, it is now much less of an issue than it used to be in terms of speed and memory usage. I modified it to use bytestrings and split the file into various 'chunks' that can be lazily loaded independently. so when it only needs the dependency information that is all it needs to pull out. and so forth. I also put a hard limit on the length of atoms to 256 bytes, which means I can chop 7 bytes off the storage of each one.
Shaving 7 bytes off seems weird when each atom is stored 100 times on disk. Also, using a single byte to store length seems an awful like only storing the last two digits in the year. I'm more interested in algorithm optimizations. Optimizations that make the code easier to use and read, and thereby avoiding silly things like storing duplicate atoms on disk.
By far the biggest win was in Info/Info, which used to store the type of each field alongside of it as a string, now it stores a hash of the type as a number, this reduced the file size (after compression) by about 30%.
Info seems a bit scary to me. It is not garbage collected, so it is up to the user to assure that references are released as early as possible. Btw, I'm trying to offer constructive criticism. I'm interested in actually writing code, not just pointing out weak spots. -- Cheers, Lemmih
On 2/14/08, Lemmih
I mean something like: data Id = Phantom | Unnamed Int | Named Atom Giving special meaning to numbers seems like a hack. Optimizations should not come at the sacrifice of readability.
It's not a hack if you use a newtype... at least, not observably. Yhc didn't, and it did do some really nasty stuff. (Possibly nhc predated newtypes?)
On Thu, Feb 14, 2008 at 8:01 PM, Samuel Bronson
On 2/14/08, Lemmih
wrote: I mean something like: data Id = Phantom | Unnamed Int | Named Atom Giving special meaning to numbers seems like a hack. Optimizations should not come at the sacrifice of readability.
It's not a hack if you use a newtype... at least, not observably. Yhc didn't, and it did do some really nasty stuff. (Possibly nhc predated newtypes?)
If you hide the implementation then you might as well provide the user with an ADT instead of an Int. The optimizations in Jhc seems to be geared towards making things more "basic". Data types are unrolled so they can fit in an integer, Haskell code is replaced by C code. These optimizations give the illusion of improvements at the cost of readability. Rewriting a piece of Haskell code in C is rather silly when it is the algorithm that's broken. -- Cheers, Lemmih
On Thu, Feb 14, 2008 at 10:10:54PM +0100, Lemmih wrote:
On Thu, Feb 14, 2008 at 8:01 PM, Samuel Bronson
wrote: On 2/14/08, Lemmih
wrote: I mean something like: data Id = Phantom | Unnamed Int | Named Atom Giving special meaning to numbers seems like a hack. Optimizations should not come at the sacrifice of readability.
It's not a hack if you use a newtype... at least, not observably. Yhc didn't, and it did do some really nasty stuff. (Possibly nhc predated newtypes?)
If you hide the implementation then you might as well provide the user with an ADT instead of an Int.
The newtype Int _is_ an ADT. as in, all that will be exported from Name.Id is Id(). Haskell's ability for abstraction is one of its great qualities. :) Exposing it abstractly would be something I'd want to do no matter how it was implemented, The internal representation doesn't matter as long as the API is clean and APIs are what I care about.
The optimizations in Jhc seems to be geared towards making things more "basic". Data types are unrolled so they can fit in an integer, Haskell code is replaced by C code. These optimizations give the illusion of improvements at the cost of readability. Rewriting a piece of Haskell code in C is rather silly when it is the algorithm that's broken.
It is about providing a strong foundation with good abstract APIs. Hopefully it will get to the point where these little things do make a difference. I don't find it hurting readability really. or if it does, that is more of an issue of documentation. like having to look inside how ids are represented means that I didn't document how they behave externally enough rather than the representation needs to change. John -- John Meacham - ⑆repetae.net⑆john⑈
On Feb 14, 2008 12:50 AM, John Meacham
(I'm gonna forward this response to the jhc list too, if that's okay) On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote:
I've been wondering about your use of Ids in Jhc. As far as I can see, the rules for Ids goes as following: * named ids are odd * unnamed ids are even * id '0' has special significance.
Can you tell me a bit about how unnamed ids are used, why id '0' has special meaning, and how you generate new, unique unnamed ids.
[snip]
named Ids are odd, these are looked up in a table that is implemented via a hash table written in c in the StringTable/ directory. due to an intentional quirk of its implementation, it only generates odd, positive values. Conversion between a number and its string form is constant time in both directions.
It looks like it has been replaced by a Haskell implementation. -- Cheers, Lemmih
On Thu, Feb 14, 2008 at 01:29:43AM +0100, Lemmih wrote:
On Feb 14, 2008 12:50 AM, John Meacham
wrote: (I'm gonna forward this response to the jhc list too, if that's okay) On Thu, Feb 14, 2008 at 12:26:49AM +0100, Lemmih wrote:
I've been wondering about your use of Ids in Jhc. As far as I can see, the rules for Ids goes as following: * named ids are odd * unnamed ids are even * id '0' has special significance.
Can you tell me a bit about how unnamed ids are used, why id '0' has special meaning, and how you generate new, unique unnamed ids.
[snip]
named Ids are odd, these are looked up in a table that is implemented via a hash table written in c in the StringTable/ directory. due to an intentional quirk of its implementation, it only generates odd, positive values. Conversion between a number and its string form is constant time in both directions.
It looks like it has been replaced by a Haskell implementation.
No, I just pushed out my changes that actually redid the atom stuff. it is now down to about a fifth of what it used to be. John -- John Meacham - ⑆repetae.net⑆john⑈
participants (3)
-
John Meacham -
Lemmih -
Samuel Bronson