implementing RealFloat decodeFloat

Hi everyone, I've been working on [0] Haskell bindings for [1] libqd for [2] double-double and quad-double arithmetic, and have been struggling to implement [3] RealFloat, in particular [4] decodeFloat, mostly because of its postcondition but also some issues relating to variable precision: ---8<--- If decodeFloat x yields (m,n), then x is equal in value to m*b^^n, where b is the floating-point radix, and furthermore, either m and n are both zero or else b^(d-1) <= m < b^d, where d is the value of floatDigits x. In particular, decodeFloat 0 = (0,0). ---8<--- (BTW: that should probably really be "... <= abs m < ..."; perhaps this code could be added to the report and/or documentation, if it is correct: validDecode :: RealFloat f => f -> Bool validDecode f = case decodeFloat f of (0,0) -> True (m, e) -> b^(d-1) <= abs m && abs m < b^d where b = floatRadix f d = floatDigits f ) The double-double and quad-double types do not have a fixed precision, though they do have a fixed minimum precision: this means that decodeFloat will (in general) lose some precision. This is because for example in: data DoubleDouble = DoubleDouble !CDouble !CDouble dd = DoubleDouble a b the semantics are that "dd = a + b", with |a|>>|b|; coupled with the IEEE implicit leading 1 bit, this means that there may be large gaps between exponents: for example: "1 + 0.5**100 :: DoubleDouble". So far I've got a "mostly working" implementation thus: decodeFloat !(DoubleDouble a b) = case (decodeFloat a, decodeFloat b) of ((0, 0), (0, 0)) -> (0, 0) ((0, 0), (m, e)) -> (m `shiftL` f, e - f) ((m, e), (0, 0)) -> (m `shiftL` f, e - f) ((m1, e1), (m2, e2)) -> let fixup m e = if m < mMin then fixup (m `shiftL` 1) (e - 1) else if m >= mMax then fixup (m `shiftR` 1) (e + 1) else (m, e) mMin = 1 `shiftL` (ff - 1) mMax = 1 `shiftL` ff ff = floatDigits (0 :: DoubleDouble) g = e1 - e2 - f in fixup ((m1 `shiftL` f) + (m2 `shiftR` g)) (e1 - f) where f = floatDigits (0 :: CDouble) This does meet the postcondition as specified (which leads to breakage in other RealFloat methods), but has a recursion with no termination proof (so far), and is lossy in general:
let check f = uncurry encodeFloat (decodeFloat f) == f check (1 + 0.5 ** 120 :: DoubleDouble) False
It does however seem to meet a weaker condition:
let check2 f = (decodeFloat . (`asTypeOf` f) . uncurry encodeFloat . decodeFloat $ f) == decodeFloat f check2 (1 + 0.5 ** 120 :: DoubleDouble) True
Questions: 1. Is this weaker condition likely to be good enough in practice? 2. Can my implementation of decodeFloat be simplified? Thanks for any insights, Claude [0] http://hackage.haskell.org/package/qd [1] http://crd.lbl.gov/~dhbailey/mpdist/ [2] http://crd.lbl.gov/~dhbailey/dhbpapers/arith15.pdf [3] http://hackage.haskell.org/packages/archive/base/latest/doc/html/Prelude.htm... [4] http://hackage.haskell.org/packages/archive/base/latest/doc/html/Prelude.htm... -- http://claudiusmaximus.goto10.org

Claude Heiland-Allen schrieb:
I've been working on [0] Haskell bindings for [1] libqd for [2] double-double and quad-double arithmetic,
At the ICIAM conference in Zurich 2007 I heard about the sum algorithm with two double precision floating point numbers in three talks. :-) Nice to know that those ideas can be found in a library.
The double-double and quad-double types do not have a fixed precision, though they do have a fixed minimum precision: this means that decodeFloat will (in general) lose some precision. This is because for example in:
data DoubleDouble = DoubleDouble !CDouble !CDouble dd = DoubleDouble a b
the semantics are that "dd = a + b", with |a|>>|b|; coupled with the IEEE implicit leading 1 bit, this means that there may be large gaps between exponents: for example: "1 + 0.5**100 :: DoubleDouble".
They are possible, but do they really occur? I thought the intention of using two CDoubles is to double the precision, not more. I can hardly imagine that its algorithms makes use of numbers like (1 + 0.5**100). Does the library specify something about maximum precision? I would just represent (1 + 0.5**100) as largeInteger*2^-largeExponent. You can represent all such numbers without loss of precision, but with loss of memory. :-)
participants (2)
-
Claude Heiland-Allen
-
Henning Thielemann