Hi again folks,
I am still at it with my time-series problem, for those who haven't been
following; I have a list of (time stamp, value) pairs and I need to do various
bits and bobs with them. I have got arithmetic down pat now, thanks to the
kind help of various members of the list - now I am looking at functions that
look at some historical data in the time-series and do some work on that to
give me an answer for a particular day.
I have chosen to represent my time series in reverse date order, since non of
the operations will ever want to look into the future, but often they would
like to look in to the past.
A function I would like to write is 'avg'. For a particular day, it computes
the average of the values last 'n' points; if there are not n points to fetch,
thee is no answer. I then combine those to make a new time series.
e.g.
If my input time series was
[(5,10),(4,20),(3,30),(2,40), (1,50)]
(Where 5, 4, 3, 2, 1 are timestamps and 10, 20, 30, 50, 50 are values)
I would like the answer
[(5,20), (4,30), (3,40)]
(e.g. 20 = (10+20+30)/3 etc.. I can't get an answer for timestamps 2 and 1
because there isn't enough historical data)
So I have written some code to do this, and it works nicely enough; but it is
_slow_. To do 1000 averages of different lengths on a series with only 3000
points takes about 200 seconds on my (not overly shabby) laptop. The
equivalent C program takes under a second.
I am entirely sure that this is due to some failing on my part. I have been
mucking around with the profiler all afternoon lazifying and delazifying
various bits and bobs with no dramatic success so I thought I might put it to
y'all if you don't mind!
So here's some code. I've kept it quite general because there are a lot of
functions I would like to implement that do similar things with bits of
historical data.
General comments on the Haskellyness/goodness of my code are welcomed as well,
I'm still very much a beginner at this!
--------- SNIP --------------
-- Take n elements from a list if at least n exist
takeMaybe n l | length l < n = Nothing
| otherwise = Just $! (take n l)
-- Little utility function, take a function f and apply it to the whole list,
-- then the tail etc...
lMap _ [] = []
lMap f (x:xs) = (f (x:xs)):(lMap f xs)
-- Little utility function to take a list containing Maybes and delete them
-- Returning a list with the values inside the Just
maybeListToList [] = []
maybeListToList (x:xs) = maybe (maybeListToList xs)
(\y -> y:(maybeListToList xs))
x
-- Return a list of lists, where each sublist is a list of the next n values
histMaybe x = lMap (takeMaybe x)
hist n x = maybeListToList $ histMaybe n x
-- Take a function which works on a list of things and apply it only to a
-- list of the second elements in a list of tuples 'l'.
applyToValues f l = let (ts,vs) = unzip l
in zip ts $ f vs
-- Create a timeseries with the cumulative sum of the last n values
cumL n l = map sum (hist n l)
cum = applyToValues . cumL
-- Creates a timeseries with the average of the last n values
avgL n l = map ((*) (1/fromIntegral(n))) $ cumL n l
avg = applyToValues . avgL
--------- SNIP --------------
According to the profiler (log attached), the vast majority of the time is
spent in takeMaybe, presumably allocating and deallocating enormous amounts of
memory for each of my little temporary sublists. I have tried liberally
sprinkling $! and 'seq' about, thinking that might help but I am clearly not
doing it right.
Perhaps list is the wrong basic data structure for what I am doing?
I hope I didn't bore you with that rather long email, I will leave it at that.
If it would be useful, I could give you the complete program with a data set
if anyone is keen enough to try for themselves.
Thanks,
Philip