
Hi again folks, I am still at it with my time-series problem, for those who haven't been following; I have a list of (time stamp, value) pairs and I need to do various bits and bobs with them. I have got arithmetic down pat now, thanks to the kind help of various members of the list - now I am looking at functions that look at some historical data in the time-series and do some work on that to give me an answer for a particular day. I have chosen to represent my time series in reverse date order, since non of the operations will ever want to look into the future, but often they would like to look in to the past. A function I would like to write is 'avg'. For a particular day, it computes the average of the values last 'n' points; if there are not n points to fetch, thee is no answer. I then combine those to make a new time series. e.g. If my input time series was [(5,10),(4,20),(3,30),(2,40), (1,50)] (Where 5, 4, 3, 2, 1 are timestamps and 10, 20, 30, 50, 50 are values) I would like the answer [(5,20), (4,30), (3,40)] (e.g. 20 = (10+20+30)/3 etc.. I can't get an answer for timestamps 2 and 1 because there isn't enough historical data) So I have written some code to do this, and it works nicely enough; but it is _slow_. To do 1000 averages of different lengths on a series with only 3000 points takes about 200 seconds on my (not overly shabby) laptop. The equivalent C program takes under a second. I am entirely sure that this is due to some failing on my part. I have been mucking around with the profiler all afternoon lazifying and delazifying various bits and bobs with no dramatic success so I thought I might put it to y'all if you don't mind! So here's some code. I've kept it quite general because there are a lot of functions I would like to implement that do similar things with bits of historical data. General comments on the Haskellyness/goodness of my code are welcomed as well, I'm still very much a beginner at this! --------- SNIP -------------- -- Take n elements from a list if at least n exist takeMaybe n l | length l < n = Nothing | otherwise = Just $! (take n l) -- Little utility function, take a function f and apply it to the whole list, -- then the tail etc... lMap _ [] = [] lMap f (x:xs) = (f (x:xs)):(lMap f xs) -- Little utility function to take a list containing Maybes and delete them -- Returning a list with the values inside the Just maybeListToList [] = [] maybeListToList (x:xs) = maybe (maybeListToList xs) (\y -> y:(maybeListToList xs)) x -- Return a list of lists, where each sublist is a list of the next n values histMaybe x = lMap (takeMaybe x) hist n x = maybeListToList $ histMaybe n x -- Take a function which works on a list of things and apply it only to a -- list of the second elements in a list of tuples 'l'. applyToValues f l = let (ts,vs) = unzip l in zip ts $ f vs -- Create a timeseries with the cumulative sum of the last n values cumL n l = map sum (hist n l) cum = applyToValues . cumL -- Creates a timeseries with the average of the last n values avgL n l = map ((*) (1/fromIntegral(n))) $ cumL n l avg = applyToValues . avgL --------- SNIP -------------- According to the profiler (log attached), the vast majority of the time is spent in takeMaybe, presumably allocating and deallocating enormous amounts of memory for each of my little temporary sublists. I have tried liberally sprinkling $! and 'seq' about, thinking that might help but I am clearly not doing it right. Perhaps list is the wrong basic data structure for what I am doing? I hope I didn't bore you with that rather long email, I will leave it at that. If it would be useful, I could give you the complete program with a data set if anyone is keen enough to try for themselves. Thanks, Philip

Am Sonntag 22 November 2009 18:59:04 schrieb Philip Scott:
Hi again folks,
I am still at it with my time-series problem, for those who haven't been following; I have a list of (time stamp, value) pairs and I need to do various bits and bobs with them. I have got arithmetic down pat now, thanks to the kind help of various members of the list - now I am looking at functions that look at some historical data in the time-series and do some work on that to give me an answer for a particular day.
I have chosen to represent my time series in reverse date order, since non of the operations will ever want to look into the future, but often they would like to look in to the past.
A function I would like to write is 'avg'. For a particular day, it computes the average of the values last 'n' points; if there are not n points to fetch, thee is no answer. I then combine those to make a new time series.
e.g.
If my input time series was
[(5,10),(4,20),(3,30),(2,40), (1,50)]
(Where 5, 4, 3, 2, 1 are timestamps and 10, 20, 30, 50, 50 are values)
I would like the answer
[(5,20), (4,30), (3,40)]
(e.g. 20 = (10+20+30)/3 etc.. I can't get an answer for timestamps 2 and 1 because there isn't enough historical data)
So I have written some code to do this, and it works nicely enough; but it is _slow_. To do 1000 averages of different lengths on a series with only 3000 points takes about 200 seconds on my (not overly shabby) laptop. The equivalent C program takes under a second.
I am entirely sure that this is due to some failing on my part. I have been mucking around with the profiler all afternoon lazifying and delazifying various bits and bobs with no dramatic success so I thought I might put it to y'all if you don't mind!
So here's some code. I've kept it quite general because there are a lot of functions I would like to implement that do similar things with bits of historical data.
General comments on the Haskellyness/goodness of my code are welcomed as well, I'm still very much a beginner at this!
--------- SNIP --------------
-- Take n elements from a list if at least n exist takeMaybe n l | length l < n = Nothing
| otherwise = Just $! (take n l)
Ouch, that makes your algorithm quadratic already. Checking "length l < n" must trverse the entire list: 3000 nodes + 2999 nodes + 2998 nodes + you get the idea. takeMaybe n l | null $ drop (n-1) l = Nothing | otherwise = Just (take n l) Or a variation, case splitAt (n-1) l of (a,h:t) -> Just (a ++ [h]) _ -> Nothing (test which is faster, play with various sorts of strictness,...)
-- Little utility function, take a function f and apply it to the whole list, -- then the tail etc... lMap _ [] = [] lMap f (x:xs) = (f (x:xs)):(lMap f xs)
lMap f = map f . tails (Data.List.tails and Data.List.inits are often useful, more idiomatic anyway)
-- Little utility function to take a list containing Maybes and delete them -- Returning a list with the values inside the Just maybeListToList [] = [] maybeListToList (x:xs) = maybe (maybeListToList xs) (\y -> y:(maybeListToList xs)) x
Look at Data.Maybe.catMaybes
-- Return a list of lists, where each sublist is a list of the next n values histMaybe x = lMap (takeMaybe x) hist n x = maybeListToList $ histMaybe n x
map (take n) $ takeWhile (not . null . drop (n-1)) $ tails xs
-- Take a function which works on a list of things and apply it only to a -- list of the second elements in a list of tuples 'l'. applyToValues f l = let (ts,vs) = unzip l in zip ts $ f vs
-- Create a timeseries with the cumulative sum of the last n values cumL n l = map sum (hist n l) cum = applyToValues . cumL
-- Creates a timeseries with the average of the last n values avgL n l = map ((*) (1/fromIntegral(n))) $ cumL n l
map (/fromIntegral n), surely?
avg = applyToValues . avgL
--------- SNIP --------------
According to the profiler (log attached), the vast majority of the time is spent in takeMaybe, presumably allocating and deallocating enormous amounts of memory for each of my little temporary sublists. I have tried liberally sprinkling $! and 'seq' about, thinking that might help but I am clearly not doing it right.
Perhaps list is the wrong basic data structure for what I am doing?
I hope I didn't bore you with that rather long email, I will leave it at that. If it would be useful, I could give you the complete program with a data set if anyone is keen enough to try for themselves.
Thanks,
Philip

On Sun, Nov 22, 2009 at 05:59:04PM +0000, Philip Scott wrote:
Hi again folks,
I am still at it with my time-series problem, for those who haven't been following; I have a list of (time stamp, value) pairs and I need to do various bits and bobs with them. I have got arithmetic down pat now, thanks to the kind help of various members of the list - now I am looking at functions that look at some historical data in the time-series and do some work on that to give me an answer for a particular day.
I have chosen to represent my time series in reverse date order, since non of the operations will ever want to look into the future, but often they would like to look in to the past.
A function I would like to write is 'avg'. For a particular day, it computes the average of the values last 'n' points; if there are not n points to fetch, thee is no answer. I then combine those to make a new time series.
e.g.
If my input time series was
[(5,10),(4,20),(3,30),(2,40), (1,50)]
(Where 5, 4, 3, 2, 1 are timestamps and 10, 20, 30, 50, 50 are values)
I would like the answer
[(5,20), (4,30), (3,40)]
(e.g. 20 = (10+20+30)/3 etc.. I can't get an answer for timestamps 2 and 1 because there isn't enough historical data)
So I have written some code to do this, and it works nicely enough; but it is _slow_. To do 1000 averages of different lengths on a series with only 3000 points takes about 200 seconds on my (not overly shabby) laptop. The equivalent C program takes under a second.
I am entirely sure that this is due to some failing on my part. I have been mucking around with the profiler all afternoon lazifying and delazifying various bits and bobs with no dramatic success so I thought I might put it to y'all if you don't mind!
So here's some code. I've kept it quite general because there are a lot of functions I would like to implement that do similar things with bits of historical data.
General comments on the Haskellyness/goodness of my code are welcomed as well, I'm still very much a beginner at this!
--------- SNIP --------------
-- Take n elements from a list if at least n exist takeMaybe n l | length l < n = Nothing | otherwise = Just $! (take n l)
-- Little utility function, take a function f and apply it to the whole list, -- then the tail etc... lMap _ [] = [] lMap f (x:xs) = (f (x:xs)):(lMap f xs)
-- Little utility function to take a list containing Maybes and delete them -- Returning a list with the values inside the Just maybeListToList [] = [] maybeListToList (x:xs) = maybe (maybeListToList xs) (\y -> y:(maybeListToList xs)) x
-- Return a list of lists, where each sublist is a list of the next n values histMaybe x = lMap (takeMaybe x) hist n x = maybeListToList $ histMaybe n x
-- Take a function which works on a list of things and apply it only to a -- list of the second elements in a list of tuples 'l'. applyToValues f l = let (ts,vs) = unzip l in zip ts $ f vs
-- Create a timeseries with the cumulative sum of the last n values cumL n l = map sum (hist n l) cum = applyToValues . cumL
-- Creates a timeseries with the average of the last n values avgL n l = map ((*) (1/fromIntegral(n))) $ cumL n l avg = applyToValues . avgL
--------- SNIP --------------
According to the profiler (log attached), the vast majority of the time is spent in takeMaybe, presumably allocating and deallocating enormous amounts of memory for each of my little temporary sublists. I have tried liberally sprinkling $! and 'seq' about, thinking that might help but I am clearly not doing it right.
Perhaps list is the wrong basic data structure for what I am doing?
I hope I didn't bore you with that rather long email, I will leave it at that. If it would be useful, I could give you the complete program with a data set if anyone is keen enough to try for themselves.
Thanks,
Philip
Sun Nov 22 17:28 2009 Time and Allocation Profiling Report (Final)
test +RTS -p -hc -RTS
total time = 162.98 secs (8149 ticks @ 20 ms) total alloc = 47,324,561,080 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
takeMaybe Main 62.2 45.9 cumL Main 36.2 52.4
individual inherited COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0 main Main 297 0 0.0 0.0 0.0 0.0 readCurve TsdbFile 298 0 0.0 0.0 0.0 0.0 CAF Main 260 4 0.0 0.0 100.0 100.0 avgL Main 281 1 0.0 0.0 0.2 0.2 cumL Main 282 1 0.1 0.1 0.2 0.2 hist Main 283 1 0.0 0.0 0.1 0.1 histMaybe Main 285 1 0.0 0.0 0.1 0.1 takeMaybe Main 287 2543 0.1 0.1 0.1 0.1 lMap Main 286 2544 0.0 0.0 0.0 0.0 maybeListToList Main 284 2544 0.0 0.0 0.0 0.0 avg Main 276 1 0.0 0.0 0.0 0.0 applyToValues Main 277 1 0.0 0.0 0.0 0.0 main Main 266 1 0.0 0.0 99.8 99.8 avg Main 288 0 0.2 0.2 99.1 99.1 avgL Main 290 0 0.0 0.0 98.8 98.4 cumL Main 291 0 36.1 52.3 98.8 98.4 hist Main 292 999 0.0 0.0 62.6 46.1 histMaybe Main 294 999 0.0 0.0 62.4 46.0 takeMaybe Main 296 1542456 62.1 45.8 62.1 45.8 lMap Main 295 1542456 0.3 0.2 0.3 0.2 maybeListToList Main 293 1542456 0.2 0.1 0.2 0.1 applyToValues Main 289 999 0.2 0.5 0.2 0.5 @+ Main 272 1000 0.0 0.0 0.6 0.6 mergeStep Main 275 1000 0.3 0.2 0.4 0.4 v Main 300 0 0.0 0.1 0.0 0.1 t Main 299 0 0.1 0.1 0.1 0.1 add Main 273 1000 0.0 0.0 0.1 0.2 binaryValueFunc Main 274 1544001 0.1 0.2 0.1 0.2 sendCurve GuiLink 268 1 0.0 0.0 0.1 0.0 putCurve GuiLink 271 1545 0.1 0.0 0.1 0.0 readCurve TsdbFile 267 1 0.0 0.0 0.0 0.0 CAF Data.Typeable 258 1 0.0 0.0 0.0 0.0 CAF GHC.IOBase 236 3 0.0 0.0 0.0 0.0 CAF GHC.Read 234 1 0.0 0.0 0.0 0.0 CAF GHC.Float 233 1 0.0 0.0 0.0 0.0 CAF Text.Read.Lex 227 6 0.0 0.0 0.0 0.0 CAF GHC.Int 222 1 0.0 0.0 0.0 0.0 CAF Data.HashTable 213 2 0.0 0.0 0.0 0.0 CAF GHC.Handle 211 5 0.0 0.0 0.0 0.0 main Main 279 0 0.0 0.0 0.0 0.0 readCurve TsdbFile 280 0 0.0 0.0 0.0 0.0 CAF GHC.Conc 210 1 0.0 0.0 0.0 0.0 CAF System.Posix.Internals 192 1 0.0 0.0 0.0 0.0 CAF TsdbFile 181 5 0.0 0.0 0.0 0.0 getCurve TsdbFile 278 1 0.0 0.0 0.0 0.0 CAF Data.Binary.IEEE754 180 6 0.0 0.0 0.0 0.0 CAF Data.Binary.Get 179 2 0.0 0.0 0.0 0.0 CAF Data.Binary.Put 151 1 0.0 0.0 0.0 0.0 CAF GuiLink 145 2 0.0 0.0 0.0 0.0 CAF Network 144 1 0.0 0.0 0.0 0.0 CAF Network.Socket 143 5 0.0 0.0 0.0 0.0 main Main 269 0 0.0 0.0 0.0 0.0 sendCurve GuiLink 270 0 0.0 0.0 0.0 0.0 CAF Network.BSD 139 1 0.0 0.0 0.0 0.0
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
Without a dataset, I don't know if this is any faster than what you have, but I think it's a fair bit prettier, so you might have more luck starting with this: -- | windows 3 [1..5] = [[1,2,3],[2,3,4],[3,4,5]] windows :: Int -> [a] -> [[a]] windows n xs = foldr (zipWith (:)) (repeat []) (take n (iterate (drop 1) xs)) and then averaging each list. Hope that helps. Alex
participants (3)
-
Alex Dunlap
-
Daniel Fischer
-
Philip Scott