
There are a number of interesting issues raised by mbeddoe's Math.Statistics. 0. Coding issues. Why use foldr1 (*) instead of product? covm xs = split' (length xs) cs where cs = [ cov a b | a <- xs, b <- xs] split' n = unfoldr (\y -> if null y then Nothing else Just $ splitAt n y) seems a rather odd way to write covm xs = [[cov a b | b <- xs] | a <- xs] which in turn is a bit wasteful because the matrix must be symmetric. I believe the author may have misunderstood "numerically stable". The obvious (sum xs)/(fromIntegral $ length $ xs) is fine for the mean, and the obvious two-pass algorithm for the variance is fine. The tricky algorithms are needed for incremental one-pass calculation. 1. How do you decide how general to make things? mean, hmean: Floating a But only + and / are needed, so Fractional a is enough. gmean: Floating a You need + and /, so fair enough. median: Floating a, Ord a High median and low median (not provided) need only Ord a. Standard median needs + and /, so fair enough. modes: Ord a Right. range: Num a, Ord a But range needs only Ord a. avgdev: Floating a Right. pvar, var, stddev, skew, kurtosis, cov, covm: Floating a How should one define the variance of a collection of complex numbers? I'm not at all sure that it makes sense. If I were dealing with complex numbers, I might treat them as 2-vectors, in which case the "standard deviation" would be a 2x2 matrix, not a complex number. I suspect that these should require Floating a, Ord a. For skew this is quite certain: one looks for "positive" or "negative" skew, and this makes no sense for complex numbers. iqr: type not specified but appears to be :: [a] -> [a] This is certainly the wrong type as the inter-quartile range is supposed to be a NUMBER, not a sequence. You have to use comparisons and subtraction, so iqr :: Num a, Ord a => [a] -> a 2. Statistics This is an odd bunch of things; mostly things that are dangerous to use. A collection of the techniques from "Applications, Basics, and Computing" by Hoaglin and Velleman (and was there someone else? memory fails me) would be nice. A few basic robust measures like - median absolute deviation - trimmed mean - Winsorized standard deviation - percentage bend correlation might be nice. 3. How laziness could help. Suppose you ask for the mean, standard deviation, and skewness of a bunch of numbers using this library. Calculating the standard deviation will recalculate the mean. Calculating the skewness will recalculate the standard deviation, which will recalculate the mean. Robust statistics, like the trimmed mean, often want to sort the data. And you don't want to keep on sorting the data over and over again. This is a case where Haskell's laziness really shines: it translates into direct practical gains for simple code. data (Floating a, Ord a) => Simple_Continuous_Variate a = SCV [a] Int a a (Array Int a) list_to_variate xs = SCV xs n m s o where n = length xs m = sum xs / fromIntegral n s = sum [(x - m)^2 | x <- xs] / fromIntegral (n - 1) o = listArray (1,n) (sort xs) vLength (SCV _ n _ _ _) = n vMean (SCV _ _ m _ _) = m vSd (SCV _ _ _ s _) = s vMin (SCV _ _ _ _ a) = a ! 1 vMax (SCV _ n _ _ a) = a ! n vRange scv = vMax scv - vMin scv vMedian (SCV _ n _ _ a) | odd n = a ! ((n+1)`div`2) | even n = ((a ! l) + (a ! u))/2 where l = n `div` 2 u = n - l ..... The wonderful thing about this is that there are no bangs in SCV, so the various pieces of information don't get evaluated until they are needed, and then don't get evaluated *again*. For example, vRange will cause the sorted array to be built at most once. Trimmed mean is very little harder to write than vMedian. Esoteric uses of lazy evaluation abound, but this one makes sense even to an old number-crunching programmer. I once built something like this in C, but boy, it was a pain.

ok wrote:
There are a number of interesting issues raised by mbeddoe's Math.Statistics.
data (Floating a, Ord a) => Simple_Continuous_Variate a = SCV [a] Int a a (Array Int a)
list_to_variate xs = SCV xs n m s o where n = length xs m = sum xs / fromIntegral n s = sum [(x - m)^2 | x <- xs] / fromIntegral (n - 1) o = listArray (1,n) (sort xs)
vLength (SCV _ n _ _ _) = n vMean (SCV _ _ m _ _) = m vSd (SCV _ _ _ s _) = s vMin (SCV _ _ _ _ a) = a ! 1 vMax (SCV _ n _ _ a) = a ! n vRange scv = vMax scv - vMin scv vMedian (SCV _ n _ _ a) | odd n = a ! ((n+1)`div`2) | even n = ((a ! l) + (a ! u))/2 where l = n `div` 2 u = n - l .....
Math.Statistics eats many good names. I would also suggest offering a type class interface. Then you could operate on various containers besides a list: -- A class for extracting a summary number from a type. -- None of these make much sense without Ord. -- I could imagine using (+) and `div` so why require (/)? class (Ord b) => SingleStat a b | a -> b where samples :: a -> Integer -- lump in with SingleStat mean :: a -> b var :: a -> b stddev :: a -> b moment :: Int -> a -> b range :: a -> b min :: a -> b max :: a -> b median :: a -> b instance SingleStat [Integer] Integer where ... instance SingleStat [Double] Double where ... instance (Ix i) => SingleStat (Array i Double) Double where ... instance SingleStat (Simple_Continuous_Variate a) where ... -- And then I could make interesting things like Histogram, -- where the data is Int's but the statistics are Doubles: newtype Histogram = Histogram (Map Int Int) instance SingleStat Histogram Double where ... Cheers, Chris

On Wed, 26 Sep 2007, ChrisK wrote:
ok wrote:
There are a number of interesting issues raised by mbeddoe's Math.Statistics.
data (Floating a, Ord a) => Simple_Continuous_Variate a = SCV [a] Int a a (Array Int a)
list_to_variate xs = SCV xs n m s o where n = length xs m = sum xs / fromIntegral n s = sum [(x - m)^2 | x <- xs] / fromIntegral (n - 1) o = listArray (1,n) (sort xs)
vLength (SCV _ n _ _ _) = n vMean (SCV _ _ m _ _) = m vSd (SCV _ _ _ s _) = s vMin (SCV _ _ _ _ a) = a ! 1 vMax (SCV _ n _ _ a) = a ! n vRange scv = vMax scv - vMin scv vMedian (SCV _ n _ _ a) | odd n = a ! ((n+1)`div`2) | even n = ((a ! l) + (a ! u))/2 where l = n `div` 2 u = n - l .....
Math.Statistics eats many good names. I would also suggest offering a type class interface. Then you could operate on various containers besides a list:
If it's only about polymorphic list types, a type class for general list types may be enough. This works without multi-parameter type class. http://software.complete.org/listlike/ Maybe it can be generalized to Foldable. http://www.haskell.org/haskellwiki/Use_of_language_extensions

ok wrote:
I believe the author may have misunderstood "numerically stable". The obvious (sum xs)/(fromIntegral $ length $ xs) is fine for the mean,
That's probably my fault, out of ignorance. Do you know a good online resource about numeric stability? (I don't have the Knuth at home. Didn't he say something about the mean formula? Or was it the standard derivation?). Regards, apfelmus

On Wed, 26 Sep 2007, apfelmus wrote:
ok wrote:
I believe the author may have misunderstood "numerically stable". The obvious (sum xs)/(fromIntegral $ length $ xs) is fine for the mean,
That's probably my fault, out of ignorance. Do you know a good online resource about numeric stability? (I don't have the Knuth at home. Didn't he say something about the mean formula? Or was it the standard derivation?).
"standard deviation"? I expect that indeed variance (thus also standard deviation) by incremental computation suffers from cancellation.
participants (4)
-
apfelmus
-
ChrisK
-
Henning Thielemann
-
ok