CPU with Haskell support

Hi all, every now and then I think it would be cool to have a microprocessor that supports Haskell in a way. A processor where lazy evaluation is not overhead but an optimization opportunity, a processor that can make use of the explicit data dependencies in Haskell programs in order to utilize many computation units in parallel. I know of the Reduceron project, which evolves only slowly and if it somewhen is ready for use it is uncertain whether it can compete with stock CPUs since FPGA's need much more chip space for the same logic. I got to know that in todays x86 processors you can alter the instruction set, which is mainly used for bugfixes. Wouldn't it be interesting to add some instructions for Haskell support? However, I suspect that such a patch might be rendered invalid by new processor generations with changed internal details. Fortunately, there are processors that are designed for custom instruction set extensions: https://en.wikipedia.org/wiki/Xtensa Would it be sensible to create a processor based on such a design? I have no idea what it might cost, and you would still need some peripheral circuitry to run it. What could processor instructions for Haskell support look like? Has anyone already thought in this direction?

This question is much more involved than you seem to be suggesting.
It's not just about adding "some instructions for Haskell support".
You have to think about how you want to express /every/ haskell term
as a series of bits (preferably predictably many bits), and find a
(finite) combination of logical gates to do arbitrary computations
with them.
If you want to go anywhere in this directions, perhaps a good start
would be implementing a processor with instructions for (untyped)
lambda calculus. One approach for this could be to take a
(mathematical) model of lambda calculus and see how its elements can
be represented as natural numbers.
This implementation, I suspect, would be terribly inefficient. Think
about what the lambda application gate would look like in terms of
NAND gates. Yes, it can probably be done in theory. No, it won't be
pretty. And forget about practical.
Finally, a major advantage of having such "raw" language as an
instruction set is that it allows many many optimizations (e.g.
pipelining (which, I would say, is the single most important reason
that processors are able to run at GHzs instead of MHzs (Pentium 4
processors, famed for their high clock speed, had 31 pipeline
stages))) that I cannot imagine being possible in anything close to a
"lambda calculus processor".
What is the added value you hope to achieve?
On 19 January 2016 at 22:12, Henning Thielemann
Hi all,
every now and then I think it would be cool to have a microprocessor that supports Haskell in a way. A processor where lazy evaluation is not overhead but an optimization opportunity, a processor that can make use of the explicit data dependencies in Haskell programs in order to utilize many computation units in parallel. I know of the Reduceron project, which evolves only slowly and if it somewhen is ready for use it is uncertain whether it can compete with stock CPUs since FPGA's need much more chip space for the same logic.
I got to know that in todays x86 processors you can alter the instruction set, which is mainly used for bugfixes. Wouldn't it be interesting to add some instructions for Haskell support? However, I suspect that such a patch might be rendered invalid by new processor generations with changed internal details. Fortunately, there are processors that are designed for custom instruction set extensions: https://en.wikipedia.org/wiki/Xtensa
Would it be sensible to create a processor based on such a design? I have no idea what it might cost, and you would still need some peripheral circuitry to run it. What could processor instructions for Haskell support look like? Has anyone already thought in this direction? _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Limiting the scope for my own sanity here, there may yet be some
application in various hardware level emulations of continuation passing
calculi, perhaps building on static single assignment.
It might be possoble to derive an interesting instruction set from the
sorts of intermediate representations we see in compiler infrastructures
like LLVM, but it is hard to guess how these hardware designs would benefit
haskell, rather than the other way around.
Cheers,
Darren
On Jan 19, 2016 14:44, "Auke Booij"
This question is much more involved than you seem to be suggesting. It's not just about adding "some instructions for Haskell support". You have to think about how you want to express /every/ haskell term as a series of bits (preferably predictably many bits), and find a (finite) combination of logical gates to do arbitrary computations with them.
If you want to go anywhere in this directions, perhaps a good start would be implementing a processor with instructions for (untyped) lambda calculus. One approach for this could be to take a (mathematical) model of lambda calculus and see how its elements can be represented as natural numbers.
This implementation, I suspect, would be terribly inefficient. Think about what the lambda application gate would look like in terms of NAND gates. Yes, it can probably be done in theory. No, it won't be pretty. And forget about practical.
Finally, a major advantage of having such "raw" language as an instruction set is that it allows many many optimizations (e.g. pipelining (which, I would say, is the single most important reason that processors are able to run at GHzs instead of MHzs (Pentium 4 processors, famed for their high clock speed, had 31 pipeline stages))) that I cannot imagine being possible in anything close to a "lambda calculus processor".
What is the added value you hope to achieve?
On 19 January 2016 at 22:12, Henning Thielemann
wrote: Hi all,
every now and then I think it would be cool to have a microprocessor that supports Haskell in a way. A processor where lazy evaluation is not
but an optimization opportunity, a processor that can make use of the explicit data dependencies in Haskell programs in order to utilize many computation units in parallel. I know of the Reduceron project, which evolves only slowly and if it somewhen is ready for use it is uncertain whether it can compete with stock CPUs since FPGA's need much more chip space for the same logic.
I got to know that in todays x86 processors you can alter the instruction set, which is mainly used for bugfixes. Wouldn't it be interesting to add some instructions for Haskell support? However, I suspect that such a
might be rendered invalid by new processor generations with changed internal details. Fortunately, there are processors that are designed for custom instruction set extensions: https://en.wikipedia.org/wiki/Xtensa
Would it be sensible to create a processor based on such a design? I have no idea what it might cost, and you would still need some peripheral circuitry to run it. What could processor instructions for Haskell support look
overhead patch like?
Has anyone already thought in this direction? _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

On Tue, 19 Jan 2016, Auke Booij wrote:
This question is much more involved than you seem to be suggesting. It's not just about adding "some instructions for Haskell support". You have to think about how you want to express /every/ haskell term as a series of bits (preferably predictably many bits), and find a (finite) combination of logical gates to do arbitrary computations with them.
I am not thinking about a radically different machine language, just a common imperative machine language with some added instructions for tasks often found in machine code generated from Haskell. E.g. mainstream processors support C function calls with special jump instruction and stack handling. Maybe there could be instructions that assist handling thunks or Haskell function calls.

On 20/01/2016, at 12:03 pm, Henning Thielemann
I am not thinking about a radically different machine language, just a common imperative machine language with some added instructions for tasks often found in machine code generated from Haskell. E.g. mainstream processors support C function calls with special jump instruction and stack handling. Maybe there could be instructions that assist handling thunks or Haskell function calls.
I was at a presentation once where the speaker showed how (thanks to the fact that Prolog doesn't evaluate arguments in calls) calling a procedure and executing the procedure could be overlapped, getting a factor of 2 speedup for an important part of the code. At another presentation a speaker showed how using a special outboard coprocessor could dramatically speed up memory management. I suspect that neither technique would be much help on today's machines and for Haskell. However, there is a hint here that doing something quite different might pay off. For example, if branch predictors don't do well with thunk handling, maybe there is a way of processing thunks that a quite different kind of branch predictor *might* cope with. Or maybe something that's expecting to process thousands of microthreads might not care about branch prediction. (Although that idea has been tried as a way of handling memory latency, I don't think it's been tried for Haskell.) Perhaps you might look for something different; instead of 'faster on similar hardware' you might look at 'cheaper'. Could a specialised Haskell processor use less energy than a standard CPU? Don't quit your day job, but don't be too sure there's nothing left to think of either.

On 20 Jan 2016, at 9:12 am, Henning Thielemann
wrote:
I got to know that in todays x86 processors you can alter the instruction set, which is mainly used for bugfixes. Wouldn't it be interesting to add some instructions for Haskell support? However, I suspect that such a patch might be rendered invalid by new processor generations with changed internal details. Fortunately, there are processors that are designed for custom instruction set extensions: https://en.wikipedia.org/wiki/Xtensa
Your post assumes that the time to fetch/decode the instruction stream is a bottleneck, and reducing the number of instructions will in some way make the program faster. Your typically lazy GHC compiled program spends much of its time building thunks and otherwise copying data between the stack and the heap. If it’s blocked waiting for data memory / data cache miss then reducing the number of instructions won’t help anything — at least if the fancy new instructions just tell the processor to do something that would lead to cache miss anyway. See: Cache Performance of Lazy Functional Programs on Current Hardware (from 2009) Arbob Ahmad and Henry DeYoung http://www.cs.cmu.edu/~hdeyoung/15740/report.pdf http://www.cs.cmu.edu/~hdeyoung/15740/report.pdf Indirect branches are also a problem (load an address from data memory, then jump to it), as branch predictors usually cannot deal with them. Slowdowns due to mispredicted branches could perhaps be mitigated by improving the branch predictor in a Haskell specific way, but you might not need new instructions to do so. Or another way of putting it: “If you tell a long story with less words, then it’s still a long story.” Ben.

On Wed, Jan 20, 2016 at 1:12 AM, Henning Thielemann < lemming@henning-thielemann.de> wrote:
Hi all,
every now and then I think it would be cool to have a microprocessor that supports Haskell in a way. A processor where lazy evaluation is not overhead but an optimization opportunity, a processor that can make use of the explicit data dependencies in Haskell programs in order to utilize many computation units in parallel. I know of the Reduceron project, which evolves only slowly and if it somewhen is ready for use it is uncertain whether it can compete with stock CPUs since FPGA's need much more chip space for the same logic.
I got to know that in todays x86 processors you can alter the instruction set, which is mainly used for bugfixes. Wouldn't it be interesting to add some instructions for Haskell support? However, I suspect that such a patch might be rendered invalid by new processor generations with changed internal details. Fortunately, there are processors that are designed for custom instruction set extensions: https://en.wikipedia.org/wiki/Xtensa
Would it be sensible to create a processor based on such a design? I have no idea what it might cost, and you would still need some peripheral circuitry to run it. What could processor instructions for Haskell support look like? Has anyone already thought in this direction? _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
I remember reading relevent paper: The Reduceron reconfigured and re-evaluated. Authors are MATTHEW NAYLOR and COLIN RUNCIMAN.

Am 19.01.2016 um 23:12 schrieb Henning Thielemann:
Fortunately, there are processors that are designed for custom instruction set extensions: https://en.wikipedia.org/wiki/Xtensa
Unfortunately, the WP article does not say anything that couldn't be said about, say, an ARM core. Other than that Xtensa core being some VLIW design.
Would it be sensible to create a processor based on such a design?
Very, very unlikely, for multiple reasons. Special-purpose CPUs have been built, most notably for LISP, less notably for Java, and probably for other purposes that I haven't heard of. Invariably, their architectural advantages were obsoleted by economy of scale: Mainstream CPUs are being produced in such huge numbers that Intel etc. could affort more engineers to optimize every nook and cranny, more engineers to optimize the structure downscaling, and larger fabs that could do more chips on more one-time expensive but per-piece cheap equipment, and in the end, the special-purpose chips were slower and more expensive. It's an extremely strong competition you are facing if you try this. Also, it is very easy to misidentify the actual bottlenecks and make instructions for the wrong ones. If caching is the main bottleneck (which it usually is), no amount of CPU improvement will help you and you'll simply need a larger cache. Or, probably, a compiler that knows enough about the program and its data flow to arrange the data in a cache-line-friendly fashion. I do not think this is going to be a long-term problem though. Pure languages have huge advantages for fine-grained parallel processing, and CPU technology is pushing towards multiple cores, so that's a natural match. As pure languages come into more widespread use, the engineers at Intel, AMD etc. will look at what the pure languages need, and add optimizations for these. Just my 2 cents. Jo

You are unneccessary overly pessimistic, let me show you somethings you,
probably, have not thought or heard about.
2016-01-20 10:51 GMT+03:00 Joachim Durchholz
Am 19.01.2016 um 23:12 schrieb Henning Thielemann:
Fortunately, there are processors that are designed for custom instruction set extensions: https://en.wikipedia.org/wiki/Xtensa
Unfortunately, the WP article does not say anything that couldn't be said about, say, an ARM core. Other than that Xtensa core being some VLIW design.
Would it be sensible to create a processor based on such a design?
Very, very unlikely, for multiple reasons.
Special-purpose CPUs have been built, most notably for LISP, less notably for Java, and probably for other purposes that I haven't heard of. Invariably, their architectural advantages were obsoleted by economy of scale: Mainstream CPUs are being produced in such huge numbers that Intel etc. could affort more engineers to optimize every nook and cranny, more engineers to optimize the structure downscaling, and larger fabs that could do more chips on more one-time expensive but per-piece cheap equipment, and in the end, the special-purpose chips were slower and more expensive. It's an extremely strong competition you are facing if you try this.
Also, it is very easy to misidentify the actual bottlenecks and make instructions for the wrong ones. If caching is the main bottleneck (which it usually is), no amount of CPU improvement will help you and you'll simply need a larger cache. Or, probably, a compiler that knows enough about the program and its data flow to arrange the data in a cache-line-friendly fashion.
A demonstration from the industry, albeit not quite hardware industry: http://www.disneyanimation.com/technology/innovations/hyperion - "Hyperion handles several million light rays at a time by sorting and bundling them together according to their directions. When the rays are grouped in this way, many of the rays in a bundle hit the same object in the same region of space. This similarity of ray hits then allows us – and the computer – to optimize the calculations for the objects hit." Then, let me bring up an old idea of mine: https://mail.haskell.org/pipermail/haskell-cafe/2009-August/065327.html Basically, we can group identical closures into vectors, ready for SIMD instructions to operate over them. The "vectors" should work just like Data.Vector.Unboxed - instead of vector of tuple of arguments there should be a tuple of vectors with individual arguments (and results to update for lazy evaluation). Combine this with sorting of addresses in case of references and you can get a lot of speedup by doing... not much.
I do not think this is going to be a long-term problem though. Pure languages have huge advantages for fine-grained parallel processing, and CPU technology is pushing towards multiple cores, so that's a natural match. As pure languages come into more widespread use, the engineers at Intel, AMD etc. will look at what the pure languages need, and add optimizations for these.
Just my 2 cents. Jo
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Am 20.01.2016 um 13:16 schrieb Serguey Zefirov:
You are unneccessary overly pessimistic, let me show you somethings you, probably, have not thought or heard about.
Okaaaay...
A demonstration from the industry, albeit not quite hardware industry:
http://www.disneyanimation.com/technology/innovations/hyperion - "Hyperion handles several million light rays at a time by sorting and bundling them together according to their directions. When the rays are grouped in this way, many of the rays in a bundle hit the same object in the same region of space. This similarity of ray hits then allows us – and the computer – to optimize the calculations for the objects hit."
Sure. If you have a few gazillion of identical algorithms, you can parallelize on that. That's the reason why 3D cards even took off, the graphics pipeline grew processing capabilities and evolved into a (rather restricted) GPU core model. So it's not necessarily impossible to build something useful, merely very unlikely.
Then, let me bring up an old idea of mine: https://mail.haskell.org/pipermail/haskell-cafe/2009-August/065327.html
Basically, we can group identical closures into vectors, ready for SIMD instructions to operate over them. The "vectors" should work just like Data.Vector.Unboxed - instead of vector of tuple of arguments there should be a tuple of vectors with individual arguments (and results to update for lazy evaluation).
Combine this with sorting of addresses in case of references and you can get a lot of speedup by doing... not much.
Heh. Such stuff could work - *provided* that you can really make a case of having enough similar work. Still, I'd work on making a model of that on GPGPU hardware first. Two advantages: 1) No hardware investment. 2) You can see what the low-hanging fruit are and get a rough first idea how much parallelization really gives you. The other approach: See what you can get out of a Xeon with really many cores (14, or even more). Compare the single-GPGPU vs. multi-GPGPU speedup with the single-CPUcore vs. multi-CPUcore speedup. That might provide insight into how well the interconnects and cache coherence protocols interfere with the multicore speedup. Why I'm so central on multicore? Because that's where hardware is going to go, because hardware isn't going to clock much higher but people will still want to improve performance. Actually I think that single-core improvements aren't going to be very important. First on my list would be exploiting multicore, second cache locality. There's more to be gotten from there than from specialized hardware, IMVHO. Regards, Jo

Many attempts were done in this area in the late 80s/early 90s (and before if you don't focus on lazy evaluation; search lisp machine). Search hardware graph reduction. The major problem has always been that producing a dedicated CPU is expensive and there's no way to keep up with the progresses in general purpose processor that can justify investments with a hugely larger user base. At some point there were even attempts of placing computation in the memory itself (typically using very fine grain combinators, SKI reduction and the such) People have given up even on hardware support for small subproblems, such as garbage collection. As for modifying the instruction set of an Intel processor, I don't know how feasible it is. But even if it is, consider that the entire architecture, pipelining, caching, predictions, speculative everythinbg etc. is hugely optimized for the typical workflow. You change that and all bets are off w.r.t performance and you may or not be ahead of the same CPU executing normal code out of a haskell compiler. On Tue, Jan 19, 2016 at 5:12 PM, Henning Thielemann < lemming@henning-thielemann.de> wrote:
Hi all,
every now and then I think it would be cool to have a microprocessor that supports Haskell in a way. A processor where lazy evaluation is not overhead but an optimization opportunity, a processor that can make use of the explicit data dependencies in Haskell programs in order to utilize many computation units in parallel. I know of the Reduceron project, which evolves only slowly and if it somewhen is ready for use it is uncertain whether it can compete with stock CPUs since FPGA's need much more chip space for the same logic.
I got to know that in todays x86 processors you can alter the instruction set, which is mainly used for bugfixes. Wouldn't it be interesting to add some instructions for Haskell support? However, I suspect that such a patch might be rendered invalid by new processor generations with changed internal details. Fortunately, there are processors that are designed for custom instruction set extensions: https://en.wikipedia.org/wiki/Xtensa
Would it be sensible to create a processor based on such a design? I have no idea what it might cost, and you would still need some peripheral circuitry to run it. What could processor instructions for Haskell support look like? Has anyone already thought in this direction? _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

Am 20.01.2016 um 18:09 schrieb Maurizio Vitale:
As for modifying the instruction set of an Intel processor, I don't know how feasible it is.
Some CPUs allowed this. There were even designs that had PL/1 instructions implemented as assembly by way of microcode. In theory it's possible for newer Intel CPUs. One point against doing so is that the microcode updates are 2-8 kBytes in size. You'd need to switch microcode with every context switch in the operating system. The other point is that microcode updates are encrypted and signed. Only Intel has the private keys needed to provide data so that an Intel CPU will accept a microcode update (this is supposed to prevent tampering with the microcode update data by malicious third parties). Sources: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32... pp. 338 ff. has infos about the general format of an update. http://www.delidded.com/how-to-update-cpu-microcode-in-award-or-phoenix-bios... lists typical microcode update sizes near the end (not sure whether these are in bytes or in DWORDs). Regards, Jo

More info on microcode updates can be found in https://www.dcddcc.com/pubs/paper_microcode.pdf Salient points: - Microcode is entirely undocumented. - Microcode is indeed encrypted, though there seem to be loopholes. - Microcode update times can take up to 2 million CPU cyles. - Microcode updates could be an unexpected attack vector. Maybe. Regards, Jo
participants (9)
-
Auke Booij
-
Ben Lippmeier
-
Darren Grant
-
Gleb Popov
-
Henning Thielemann
-
Joachim Durchholz
-
Maurizio Vitale
-
Richard A. O'Keefe
-
Serguey Zefirov