Atoms, Infos and unique ids.

Lemmih

21 Feb 2008 21 Feb '08

7:32 p.m.

Hiya, TVr's can be large and are duplicated many times over. We avoid duplicating Info and type information by clearing all variables and restoring them from their binding point. Atoms are very similar to Infos in this regard: they're large enough that we don't want to keep multiple copies around. However, atoms are currently dealt with in an unsafe (see 'getMap') and inefficient manner (10mb (40%) in base-1.0.hl is wasted on duplicate strings). It is my understanding that atoms could be stored exactly like Infos. I tried to implement this but, unfortunately, atoms are frequently being relied on for unique ids. Some cases are easy to fix, others less so. John, do you have time to document the intended behavior in the difficult cases? -- Cheers, Lemmih

Show replies by date

John Meacham

21 Feb 21 Feb

8:09 p.m.

On Thu, Feb 21, 2008 at 08:32:47PM +0100, Lemmih wrote:

...

I tried to implement this but, unfortunately, atoms are frequently being relied on for unique ids. Some cases are easy to fix, others less so. John, do you have time to document the intended behavior in the difficult cases?

I am not sure what you mean by difficult cases or what you are trying to fix, atoms are exactly an implementation of the standard atom type in computing as used in prolog, lisp, and the X11 protocol among other things. Uninterpreted strings with a very fast identity operation. Things got a lot better in terms of space once I made them a custom binary implementation, that saved 7 bytes an atom which is often as long as the string itself. In any case, I would want any solution to be completely independent of the fact that atoms are being used as identifiers in a programming intermediate langauge. My Atom type is a generally useful library I use other places. I would think this would involve a custom Binary monad that distributed and collected an 'atom environment' of sorts that would then be stored in a different chunk in the file, then whenever one wanted to store/retrieve an atom, they would just add an index into the table. John -- John Meacham - ⑆repetae.net⑆john⑈

Lemmih

9:03 p.m.

On Thu, Feb 21, 2008 at 9:09 PM, John Meacham wrote:

...

On Thu, Feb 21, 2008 at 08:32:47PM +0100, Lemmih wrote:

...
I tried to implement this but, unfortunately, atoms are frequently being relied on for unique ids. Some cases are easy to fix, others less so. John, do you have time to document the intended behavior in the difficult cases?

I am not sure what you mean by difficult cases or what you are trying to fix, atoms are exactly an implementation of the standard atom type in computing as used in prolog, lisp, and the X11 protocol among other things. Uninterpreted strings with a very fast identity operation.

Atoms in general are fine but they're not the right tool for this particular job. We rarely use atoms for keys (I have the profiling information to back this up) and, when we do, using (hash,string) is ridiculously fast. The costs of using atoms include: broken properties (get . put = id), inefficiencies and obfuscated code. Given that I have significantly improved the performance of Jhc, I hope it carries some weight when I say that there are NO performance reasons for using atoms, neither in CPU time or memory usage.

...

Things got a lot better in terms of space once I made them a custom binary implementation, that saved 7 bytes an atom which is often as long as the string itself.

That's like giving ice cream to a kid with tuberculosis: It's a a nice thing to do but curing the disease would be better. Those 7 bytes would be completely irrelevant if we only saved unique strings once. There are about 9000 unique strings. A 7 byte overhead would be 63k. We currently waste 10,000k by saving ~100 copies of each unique string.

...

In any case, I would want any solution to be completely independent of the fact that atoms are being used as identifiers in a programming intermediate langauge.

I couldn't agree more. I say, let's take it one step further and assign unique ids to named variables in the same way we do with unnamed variables. Each variable would have a 'Anonymous | Named Name' tag that would be used for pretty-printing.

...

My Atom type is a generally useful library I use other places. I would think this would involve a custom Binary monad that distributed and collected an 'atom environment' of sorts that would then be stored in a different chunk in the file, then whenever one wanted to store/retrieve an atom, they would just add an index into the table.

The only other usage I've found was in Stats.hs. And that's definitely not justified. I was referring to cases like DataConstructors.hs and LambdaLift.hs. In DataConstructor.hs it is trivial to assign a new id since it only has to be locally unique. In LambdaLift.hs, on the other hand, I can't tell whether shadowing is OK, whether reusing the old id is OK, or where to get a set free variables if a unique id is required. Some documentation about what the code tries to do would be very helpful. -- Cheers, Lemmih

John Meacham

10:23 p.m.

On Thu, Feb 21, 2008 at 10:03:08PM +0100, Lemmih wrote:

...

Atoms in general are fine but they're not the right tool for this particular job. We rarely use atoms for keys (I have the profiling information to back this up) and, when we do, using (hash,string) is ridiculously fast. The costs of using atoms include: broken properties (get . put = id), inefficiencies and obfuscated code. Given that I have significantly improved the performance of Jhc, I hope it carries some weight when I say that there are NO performance reasons for using atoms, neither in CPU time or memory usage.

Well, the main win is the unboxed Int# inside of TVr and being able to use IdMaps and being able to locally generate unique names based on current ones without refering to and keeping a global table up to date. (see docs/conventions.txt for information on some of the ways unique names are generate from existing ones) Plus, Atoms feel like the aethetic correct choice to me. Uninterpreted unique strings with fast identity, exactly what I want out of an identifier. Even if they were just newtype Atom = Atom String. (A perfectly valid, if inefficient, implementation)

...

...
Things got a lot better in terms of space once I made them a custom binary implementation, that saved 7 bytes an atom which is often as long as the string itself.

That's like giving ice cream to a kid with tuberculosis: It's a a nice thing to do but curing the disease would be better.

Those 7 bytes would be completely irrelevant if we only saved unique strings once. There are about 9000 unique strings. A 7 byte overhead would be 63k. We currently waste 10,000k by saving ~100 copies of each unique string.

but this seems completely orthogonal to using atoms in general, any system allowing string ids will need to store them somehow.

...

...
In any case, I would want any solution to be completely independent of the fact that atoms are being used as identifiers in a programming intermediate langauge.

I couldn't agree more. I say, let's take it one step further and assign unique ids to named variables in the same way we do with unnamed variables. Each variable would have a 'Anonymous | Named Name' tag that would be used for pretty-printing.

We rarely have a list of all names in scope though. or piping one around would be unwieldy. the 'name's' of named variables have semantic meaning as well (as described in docs/conventions.txt). Plus, I am wary of non-abstract types in general after some bad growing pains.

...

...
My Atom type is a generally useful library I use other places. I would think this would involve a custom Binary monad that distributed and collected an 'atom environment' of sorts that would then be stored in a different chunk in the file, then whenever one wanted to store/retrieve an atom, they would just add an index into the table.

The only other usage I've found was in Stats.hs. And that's definitely not justified.

Again, this is more for aethetics and clear code intent (and incidental performance gains, no matter how small, are always good). Atoms are what I mean so they are what I use.

...

I was referring to cases like DataConstructors.hs and LambdaLift.hs. In DataConstructor.hs it is trivial to assign a new id since it only has to be locally unique.

Yes. If you are refering to the using [2,4 ..] there, this is one of the last places that the Id abstraction escapes. once I clear that up I can finally fully abstract Id. Yay.

...

In LambdaLift.hs, on the other hand, I can't tell whether shadowing is OK, whether reusing the old id is OK, or where to get a set free variables if a unique id is required. Some documentation about what the code tries to do would be very helpful.

I believe the current lambda lifting algorithm requires fully globally unique names for everything. shadowing would not be okay as definitons might be lifted to the same level. it is tricky. I would have to reread it to find out for sure. As with most optimization passes in jhc, it is not my first implementation and I don't expect it to be the last. (for a while I had 4 different strictness analyzers to choose from :) ) At the moment my development tree can't compile anything (without -fno-rules) because I am in the middle of completely rewriting the rules mechanism. (they will no longer be carried in the Info nodes at all) and supercombinators will be logically distinct from let bound values. treating the whole program as a simple mutally recursive let binding is quite elegant, but i have been holding onto that ideal for way to long and have had to go through too many crazy contortions to keep up that unrealistic world view. I already had to have special cases in the type system (top level bindings can have superkind ##, local ones cannot) among other places. So I'd like to hold off on any non localized changes for the time being. In any case, I don't want to do anything with Ids at the moment other than making them more abstract, I am trying to get 0.5 out the door and I am really uncomfortable making multiple fundamental changes at once, especially to something as so very tricky as id naming. If you want crazy obfuscated code in pursuit of performance, look at ghc. at least I try to stay mostly haskell 98 (well, haskell-prime beta is actually what I target). avoiding things like explicit unboxed types and try to keep all strangeness encapsulated behind abstract types with well defined interfaces. ghc lobs around Int#'s like candy :) BTW. I also have been working on a full honest to goodness manual for jhc. my patches implementing it are just starting to appear in the repository. The rough idea is you can create comments anywhere (or in separate files) that start with {-@Section and contain markdown formatted text and they will be stitched together to form the manual. It is filling out nicely. John -- John Meacham - ⑆repetae.net⑆john⑈

Lemmih

11:13 p.m.

On Thu, Feb 21, 2008 at 11:23 PM, John Meacham wrote:

...

If you want crazy obfuscated code in pursuit of performance, look at ghc. at least I try to stay mostly haskell 98 (well, haskell-prime beta is actually what I target). avoiding things like explicit unboxed types and try to keep all strangeness encapsulated behind abstract types with well defined interfaces. ghc lobs around Int#'s like candy :)

I wanna make a quick response to this segment because I actually feel slightly insulted. I'm trying to get rid of a global mutable state and some C code that segfaults if used incorrectly, and you're accusing me of obfuscating the code? I'm proposing to associate names directly with variables instead of using a magic pointer. It would be the natural thing to do, completely valid Haskell98 code, and several times faster than the current approach. I feel you've made a serious accusation without the evidence to back it up. ): -- Cheers, Lemmih

John Meacham

11:53 p.m.

On Fri, Feb 22, 2008 at 12:13:10AM +0100, Lemmih wrote:

...

On Thu, Feb 21, 2008 at 11:23 PM, John Meacham wrote:

...
If you want crazy obfuscated code in pursuit of performance, look at ghc. at least I try to stay mostly haskell 98 (well, haskell-prime beta is actually what I target). avoiding things like explicit unboxed types and try to keep all strangeness encapsulated behind abstract types with well defined interfaces. ghc lobs around Int#'s like candy :)

I wanna make a quick response to this segment because I actually feel slightly insulted. I'm trying to get rid of a global mutable state and some C code that segfaults if used incorrectly, and you're accusing me of obfuscating the code?

Oh, no, I didn't mean to imply that at all. I _really_ appreciate the work you are putting in to making jhc faster. what I meant to say is that I know parts of jhc are _already_ obfuscated, but that as far as haskell compilers go, it's not as bad as it could be. :) As in, I endevour to make sure the APIs are clean and well founded. that is what really matters when it comes to maintainability of code. The Id implementation can be swapped out willy nilly once it is fully abstracted, so worrying about how it actually is implemented seems premature. In the end, once the APIs are stable, it is easy enough to plug in a variety of implementations and just try em all out. That should be true of any component of jhc if I did my design right.

...

I'm proposing to associate names directly with variables instead of using a magic pointer. It would be the natural thing to do, completely valid Haskell98 code, and several times faster than the current approach.

Hmm... are you sure it would be faster? perhaps I don't fully understand what you want to do, but Atoms were darn fast when I was benchmarking, I could have broken them though. I mean, perhaps the speed benefit isn't that useful for jhc... I use the same atom implementation in C projects but I enjoy that in haskell-land I can hide the implementation behind a newtype to make them fully safe. I heart haskell. John -- John Meacham - ⑆repetae.net⑆john⑈

John Meacham

22 Feb 22 Feb

12:21 a.m.

On Thu, Feb 21, 2008 at 03:53:26PM -0800, John Meacham wrote:

...

Oh, no, I didn't mean to imply that at all. I _really_ appreciate the work you are putting in to making jhc faster. what I meant to say is that I know parts of jhc are _already_ obfuscated, but that as far as haskell compilers go, it's not as bad as it could be. :)

Oh, I should also say that I mean no disrespect to the GHC developers, the only reason I can use simple strictness annotations instead of explicit unboxed types is their sweet strictness analyzer, the only reason I can use more readable 'do' syntax rather than `thenM` style monadic code is their sweet implementation of classes. GHC is great code and the papers that came out of its implementation were the inspiration for most parts of jhc. :) John -- John Meacham - ⑆repetae.net⑆john⑈

Lemmih

12:34 a.m.

On Fri, Feb 22, 2008 at 12:53 AM, John Meacham wrote:

...

On Fri, Feb 22, 2008 at 12:13:10AM +0100, Lemmih wrote:

...
On Thu, Feb 21, 2008 at 11:23 PM, John Meacham wrote:

...
If you want crazy obfuscated code in pursuit of performance, look at ghc. at least I try to stay mostly haskell 98 (well, haskell-prime beta is actually what I target). avoiding things like explicit unboxed types and try to keep all strangeness encapsulated behind abstract types with well defined interfaces. ghc lobs around Int#'s like candy :)

I wanna make a quick response to this segment because I actually feel slightly insulted. I'm trying to get rid of a global mutable state and some C code that segfaults if used incorrectly, and you're accusing me of obfuscating the code?

Oh, no, I didn't mean to imply that at all. I _really_ appreciate the work you are putting in to making jhc faster. what I meant to say is that I know parts of jhc are _already_ obfuscated, but that as far as haskell compilers go, it's not as bad as it could be. :)

Ah, okay.

...

As in, I endevour to make sure the APIs are clean and well founded. that is what really matters when it comes to maintainability of code. The Id implementation can be swapped out willy nilly once it is fully abstracted, so worrying about how it actually is implemented seems premature. In the end, once the APIs are stable, it is easy enough to plug in a variety of implementations and just try em all out. That should be true of any component of jhc if I did my design right.

...
I'm proposing to associate names directly with variables instead of using a magic pointer. It would be the natural thing to do, completely valid Haskell98 code, and several times faster than the current approach.

Hmm... are you sure it would be faster? perhaps I don't fully understand what you want to do, but Atoms were darn fast when I was benchmarking, I could have broken them though. I mean, perhaps the speed benefit isn't that useful for jhc... I use the same atom implementation in C projects but I enjoy that in haskell-land I can hide the implementation behind a newtype to make them fully safe. I heart haskell.

The problem is not atoms per se; it's generating ids from a global store. When saving TVr's with associated names, the atom has to be saved instead of just the id. Duplicating few strings may not sound too serious but it does take its toll. It is relatively minor (but still significant) when compiling base-1.0.hl (7% cpu, 8% memory usage). However, it is dominating when compiling smaller pieces of code such as HelloWorld. -- Cheers, Lemmih

John Meacham

12:56 a.m.

On Fri, Feb 22, 2008 at 01:34:32AM +0100, Lemmih wrote:

...

...
Hmm... are you sure it would be faster? perhaps I don't fully understand what you want to do, but Atoms were darn fast when I was benchmarking, I could have broken them though. I mean, perhaps the speed benefit isn't that useful for jhc... I use the same atom implementation in C projects but I enjoy that in haskell-land I can hide the implementation behind a newtype to make them fully safe. I heart haskell.

The problem is not atoms per se; it's generating ids from a global store. When saving TVr's with associated names, the atom has to be saved instead of just the id. Duplicating few strings may not sound too serious but it does take its toll. It is relatively minor (but still significant) when compiling base-1.0.hl (7% cpu, 8% memory usage). However, it is dominating when compiling smaller pieces of code such as HelloWorld.

Oh, yes. I agree this needs to be fixed. but it seems orthogonal to the issue of using atoms in the first place, as in. Any solution that works for something that uses 'String' will work for 'Atom' and vice versa. Something to try might be to rename everything other than top level names to numeric ids right before writing out the ho file. Top level names need to be globally unique, across time and space, as in, things will break if two independent modules chose the same top level name for something and then imported each other, but presumably by the time you write out the ho you have lambda lifted most stuff out and the debugging value of preserving them is less. Though, you have not converted things to supercombinators which will introduce more global names. (I actually should look into whether doing this before writing ho files is useful). In any case. it might be a simple thing to test out. Otherwise I think the idea of a Binary type that maintains an environment would be good, as we might find other uses for it besides jhc identifiers. John -- John Meacham - ⑆repetae.net⑆john⑈

Lemmih

1:23 a.m.

On Fri, Feb 22, 2008 at 1:56 AM, John Meacham wrote:

...

On Fri, Feb 22, 2008 at 01:34:32AM +0100, Lemmih wrote:

...
...
Hmm... are you sure it would be faster? perhaps I don't fully understand what you want to do, but Atoms were darn fast when I was benchmarking, I could have broken them though. I mean, perhaps the speed benefit isn't that useful for jhc... I use the same atom implementation in C projects but I enjoy that in haskell-land I can hide the implementation behind a newtype to make them fully safe. I heart haskell.

The problem is not atoms per se; it's generating ids from a global store. When saving TVr's with associated names, the atom has to be saved instead of just the id. Duplicating few strings may not sound too serious but it does take its toll. It is relatively minor (but still significant) when compiling base-1.0.hl (7% cpu, 8% memory usage). However, it is dominating when compiling smaller pieces of code such as HelloWorld.

Oh, yes. I agree this needs to be fixed. but it seems orthogonal to the issue of using atoms in the first place, as in. Any solution that works for something that uses 'String' will work for 'Atom' and vice versa.

Yes, sorry for not being clear. It's the pointers to atoms, and not atoms themselv, that are causing trouble.

...

Something to try might be to rename everything other than top level names to numeric ids right before writing out the ho file. Top level names need to be globally unique, across time and space, as in, things will break if two independent modules chose the same top level name for something and then imported each other, but presumably by the time you write out the ho you have lambda lifted most stuff out and the debugging value of preserving them is less.

Though, you have not converted things to supercombinators which will introduce more global names. (I actually should look into whether doing this before writing ho files is useful).

In any case. it might be a simple thing to test out.

Otherwise I think the idea of a Binary type that maintains an environment would be good, as we might find other uses for it besides jhc identifiers.

I a fan of assigning locally unique ids as necessary since we already do this on a per-function level. -- Cheers, Lemmih

6526

Age (days ago)

6527

Last active (days ago)

List overview

Download

9 comments

2 participants

participants (2)

John Meacham
Lemmih

Atoms, Infos and unique ids.

tags

participants (2)