Naive question on lists of duplicates

Hi, all. I have an exceedingly simple problem to address, and am wondering if there are relatively straightforward ways to improve the efficiency of my solution. The task is simply to look at a lengthy list of stock keeping units (SKUs -- what retailers call individual items), stores, dates that a promotion started, dates the promotion ended, and something like sales amount; we want to pull out the records where promotions overlap. I will have dates in yyyymmdd format, so there's probably no harm in treating them as Ints. My suggestion went something like this (I'm not at my desk so I don't have exactly what I typed):
data PromotionRec = PR {sku :: String, store :: String, startDate :: Int, endDate :: Int, amount::Float}
match :: PromotionRec -> [PromotionRec] -> [PromotionRec] match _ [] = [] match x (x:xs) = filter (meet x) xs
nomatch :: PromoRec -> [PromotionRec] -> [PromotionRec] nomatch _ [] = [] nomatch x (x:xs) = filter (nomeet x) xs
meet :: PromoRec -> PromoRec -> Bool meet x y = and [startDate x <= endDate y, startDate y <= startDate x]
nomeet :: PromoRec -> PromoRec -> Bool nomeet x y = not (meet x y)
overlaps :: [PromoRec] -> [[PromoRec]] overlaps [] = [] overlaps (x:xs) = (x : match x xs) : (overlaps (nomatch x xs))
overlap :: [PromoRec] -> [[PromoRec]] overlap [] = [] overlap (x:xs) = filter g1 (overlaps (x:xs)) where g1 ys = (length ys) > 1
What I sent might have been slightly different, and I might have some typos in the above (as I'm composing from memory at the keyboard), but that's the rough idea: treat this like a quicksort, pulling out the records that overlap the first record, consing the first record onto the list, and then consing the resulting list onto the same function applied to the records that didn't overlap. As a last step, delete the records that were not on promotion while anything else was on promotion. I'm pretty confident that this will be more efficient than my colleague's SAS code, as he was comparing each record to every other record (giving n (n-1) comparisons). It seems like this, in the worst case where everything is on promotion at distinct times, will compare the first record to (n-1) records, the second to (n-2) records, etc., giving n (n-1)/2 comparisons. Thus, while this is worst-case O(n^2), it seems like it should have at most half as much work to do as the earlier approach in SAS. On the other hand, in the best case, when everything is on promotion at the same time, there should be something like n-1 comparisons made and then that should be it. So it seems like, depending on how frequently promotions co-occur for this retailer, the above should be somewhere between O(n) and O(n^2) time complexity. Since these people are always having some sale or other, I suspect this won't be much worse than O(n log n). Is there anything that is an obvious improvement in efficiency -- some clever tricks with the fold functions, some way of loading the data into some sort of convenient structure first, etc -- that I could easily implement? Please share your thoughts. Many thanks, Jack Stecher jack.stecher@retek.com

I'm pretty confident that this will be more efficient than my colleague's SAS code, as he was comparing each record to every other record (giving n (n-1) comparisons). It seems like this, in the worst case where everything is on promotion at distinct times, will compare the first record to (n-1) records, the second to (n-2) records, etc., giving n (n-1)/2 comparisons. Thus, while this is worst-case O(n^2), it seems like it should have at most half as much work to do as the earlier approach in SAS. On the other hand, in the best case, when everything is on promotion at the same time, there should be something like n-1 comparisons made and then that should be it. So it seems like, depending on how frequently promotions co-occur for this retailer, the above should be somewhere between O(n) and O(n^2) time complexity. Since these people are always having some sale or other, I suspect this won't be much worse than O(n log n).
Is there anything that is an obvious improvement in efficiency -- some clever tricks with the fold functions, some way of loading the data into some sort of convenient structure first, etc -- that I could easily implement?
Firstly I'm assuming that you are working with a granularity of days, and that each promotion will always have a relatively small maximum number of days. If so, how about something like the following: 1: Filter the existing data structure, resulting in a lazy list of tuples (a, b) where a is a day number and b is an identifier for the promotion. Where a promotion spans n days, the list will contain n entries, one for each day of the promotion. Complexity is O(M x N), where M is the (small) maximum number of days. If this is fixed *and* small, we can discard this any regard the complexity as just O(N). 2: Sort the list with a as the primary key and b as the secondary key. Complexity should be near enough O(N log N) 3: Traverse the results of the sort, outputting a lazy list of lists such that the elements of each sub-list are references to the promotions that overlap for one specific day number. Where no overlap is detected for one specific day, that day can simply be ignored. Complexity should be linear. 4: Sort this list, and discard duplicates. Complexity should be O(N log N) for the sort and O(N) for the 'uniq'. 5: You are now left with a list which describes all overlapping promotions. Total complexity should effectively be O(N + N log N + N + N log N + N) which of course just collapses to O(N log N). -- ---------------------------------------------- / __ + / Sarah Thompson **** / / (_ _ _ _ |_ / * / / __)(_|| (_|| ) / sarah@telergy.com * / / + / http://findatlantis.com/ / ----------------------------------------------

On Thu, Jun 05, 2003 at 08:09:02AM -0500, Stecher, Jack wrote:
The task is simply to look at a lengthy list of stock keeping units (SKUs -- what retailers call individual items), stores, dates that a promotion started, dates the promotion ended, and something like sales amount; we want to pull out the records where promotions overlap. I will have dates in yyyymmdd format, so there's probably no harm in treating them as Ints.
(Disclaimer: provided I've understood your problem correctly...) For `heap' below, read `Data.FiniteMap'. Create an empty heap. For each record, For each day the promotion runs, Insert the record[1] into the heap with the date as the key. If it clashes with an existing element, then we've an overlap, Store both records in another heap keyed on the SKU. We can ignore clashes. Don't forget to insert the rest of the dates. (I'm sure there's plenty of optimisations that can be done here...) [1] Or at least the SKU, or anything that can uniquely identify the promotion. As Sarah said, _provided_ there's a small bound on the number of days that a promotion can run for, then we can assume it to be a constant, resulting in an O(n log n) algorithm. Your dates are already Ord'ered, so that's good. The only niggle I see is that you're going to have to figure out some way of generating all the dates for which a promotion runs, which is going to be messy. Thankfully (I think) the System.Time module should take care of this for you, if you can massage your existing dates into some workable form... /Liyang [0] or some other suitable data structure. I like heaps because you can avoid costly O(n^2) duplicate checking by paying a price of O(log n) for each insertion. :) -- .--| Liyang HU |--| http://nerv.cx/ |--| Caius@Cam |--| ICQ: 39391385 |--. | :::::::::::::::::::::: This is not a signature. :::::::::::::::::::::: |

On Thu, Jun 05, 2003 at 08:09:02AM -0500, Stecher, Jack wrote:
I have an exceedingly simple problem to address, and am wondering if there are relatively straightforward ways to improve the efficiency of my solution.
Was there actually a problem with the efficiency of your first code?
The task is simply to look at a lengthy list of stock keeping units (SKUs -- what retailers call individual items), stores, dates that a promotion started, dates the promotion ended, and something like sales amount; we want to pull out the records where promotions overlap. I will have dates in yyyymmdd format, so there's probably no harm in treating them as Ints.
(Unless this is really a one-shot deal, I suspect using Ints for dates is a bad decision...)
My suggestion went something like this (I'm not at my desk so I don't have exactly what I typed):
I have a different algorithm, which should be nearly optimal, but I find it harder to describe than to show the code (which is untested):
import List(sortBy, insertBy)
data PromotionRec = PR {sku :: String, store :: String, startDate :: Int, endDate :: Int, amount::Float}
compareStart, compareEnd :: PromotionRec -> PromotionRec -> Ordering compareStart x y = compare (startDate x) (startDate y) compareEnd x y = compare (endDate x) (endDate y)
overlap :: [PromoRec] -> [[PromoRec]] overlap l = filter (lambda l. length l > 1) (overlap' [] (sortBy compareStart l))
overlap' _ [] = [] overlap' active (x:xs) = let {active' = dropWhile (lambda y. endDate y < startDate x) active} in (x:active') : overlap' (insertBy compareEnd x active') xs
The key is that, by keeping a list of the currently active promotions in order sorted by the ending date, we only need to discared an initial portion of the list. You could get a moderately more efficient implementation by keeping the active list as a heap rather than a list. Peace, Dylan
participants (4)
-
Dylan Thurston
-
Liyang HU
-
Sarah Thompson
-
Stecher, Jack