@sven and @henning :
i'm actually doing some preliminary work to add save and restore for FPU state to the GHC RTS, at the green/haskell thread layer. after first ripping out x87 code gen, which just needs some more docs written out before its merged in. note that i'm speaking specifically of the MXCSR register save and restore, not the more hefty operations you might be thinking.
FPU mode state save and restore is done already on EVERY OS when switching threads/processes, and in the agner fog latency tables the cost of manipulating mxcsr registers is pretty small!
LDMXCSR (restore) and STMXCSR (save) have cpu latencies at like 5-20 cycles (more often 8-15), so having the current C ffi calls set the default C FPU environment (as we currently have ordinarily) is super doable to ensure no breakage of existing C bindings, plus have a new ccall variant that inherits the host haskell thread FPU state. we're talking sub 10 nanosecond overhead on x86 and x86_64 platforms (and either way, on those platforms soon ghc will only be using the sse2 or higher ).
point being: aside from like AMD piledriver micro architecture and some stuff from VIA, the performance of the CPU instruction for the signalling nans state setup and related rounding mode etc, should work perfectly well,
@Daniel Cartwright I do not support documenting false laws in any enshrined way, it will result in broken code. (Also i'm actually working to do some fixes, if you reread my remarks and merijn's, and i think we can have our cake and eat it, with the finest floats). Lets fix stuff and then document true laws!