Performance

22 Nov 2009

      Hi again folks, 

I am still at it with my time-series problem, for those who haven't been 
following; I have a list of (time stamp, value) pairs and I need to do various 
bits and bobs with them. I have got arithmetic down pat now, thanks to the 
kind help of various members of the list - now I am looking at functions that 
look at some historical data in the time-series and do some work on that to 
give me an answer for a particular day.

I have chosen to represent my time series in reverse date order, since non of 
the operations will ever want to look into the future, but often they would 
like to look in to the past.

A function I would like to write is 'avg'. For a particular day, it computes 
the average of the values last 'n' points; if there are not n points to fetch, 
thee is no answer. I then combine those to make a new time series.

e.g.

If my input time series was

[(5,10),(4,20),(3,30),(2,40), (1,50)]

(Where 5, 4, 3, 2, 1 are timestamps and 10, 20, 30, 50, 50 are values)

I would like the answer

[(5,20), (4,30), (3,40)]

(e.g. 20 = (10+20+30)/3 etc.. I can't get an answer for timestamps 2 and 1 
because there isn't enough historical data)

So I have written some code to do this, and it works nicely enough; but it is 
_slow_. To do 1000 averages of different lengths on a series with only 3000 
points takes about 200 seconds on my (not overly shabby) laptop. The 
equivalent C program takes under a second.

I am entirely sure that this is due to some failing on my part. I have been 
mucking around with the profiler all afternoon lazifying and delazifying 
various bits and bobs with no dramatic success so I thought I might put it to 
y'all if you don't mind!

So here's some code. I've kept it quite general because there are a lot of 
functions I would like to implement that do similar things with bits of 
historical data.

General comments on the Haskellyness/goodness of my code are welcomed as well, 
I'm still very much a beginner at this!

--------- SNIP --------------

-- Take n elements from a list if at least n exist
takeMaybe n l | length l < n = Nothing
              | otherwise    = Just $! (take n l)

-- Little utility function, take a function f and apply it to the whole list, 
-- then the tail etc...
lMap _ []     = []
lMap f (x:xs) = (f (x:xs)):(lMap f xs)

-- Little utility function to take a list containing Maybes and delete them
-- Returning a list with the values inside the Just
maybeListToList [] = []
maybeListToList (x:xs) = maybe (maybeListToList xs)
                               (\y -> y:(maybeListToList xs))
                               x

-- Return a list of lists, where each sublist is a list of the next n values
histMaybe x = lMap (takeMaybe x)
hist n x = maybeListToList $ histMaybe n x

-- Take a function which works on a list of things and apply it only to a
-- list of the second elements in a list of tuples 'l'.
applyToValues f l = let (ts,vs) = unzip l
                                in zip ts $ f vs

-- Create a timeseries with the cumulative sum of the last n values
cumL n l = map sum (hist n l)
cum = applyToValues . cumL                                                                                                            

-- Creates a timeseries with the average of the last n values
avgL n l = map ((*) (1/fromIntegral(n))) $ cumL n l
avg = applyToValues . avgL

--------- SNIP --------------

According to the profiler (log attached), the vast majority of the time is 
spent in takeMaybe, presumably allocating and deallocating enormous amounts of 
memory for each of my little temporary sublists. I have tried liberally 
sprinkling $! and 'seq' about, thinking that might help but I am clearly not 
doing it right.

Perhaps list is the wrong basic data structure for what I am doing?

I hope I didn't bore you with that rather long email, I will leave it at that. 
If it would be useful, I could give you the complete program with a data set 
if anyone is keen enough to try for themselves.

Thanks,

Philip

Philip Scott

Daniel Fischer

Alex Dunlap

tags

participants (3)