sorting almost sorted list - Beginners - Haskell.org

newer
Builds with cabal, but not with...

sorting almost sorted list

older
Re: [Haskell-beginners] sorting...

Dennis Raddle

27 Sep 2011 27 Sep '11

10:46 a.m.

I have a problem from music. I have a list of notes sorted by begin time. I want to sort them by end time. Notes that are sorted by begin time are usually close to sorted by end time, because notes tend to cluster within a small range of durations. What algorithms are available to me (hopefully in a library) that are good with this kind of thing? Dennis

Reply

Sign in to reply online Use email software

Show replies by date

David Place

27 Sep 27 Sep

11:21 a.m.

There are probably to few notes in a piece of music to worry about algorithm complexity. The standard sort function should work just fine. _____________________ David F. Place Owner, Panpipes Ho!, LLC http://panpipesho.com On Sep 26, 2011, at 10:46 PM, Dennis Raddle wrote:

I have a problem from music. I have a list of notes sorted by begin time. I want to sort them by end time.

Reply

Sign in to reply online Use email software

Yitzchak Gale

4:08 p.m.

Dennis Raddle wrote:

I have a problem from music. I have a list of notes sorted by begin time. I want to sort them by end time.

Notes that are sorted by begin time are usually close to sorted by end time, because notes tend to cluster within a small range of durations.

What algorithms are available to me (hopefully in a library) that are good with this kind of thing?

Sorry, I don't know of any library that handles this special case. I'm also not too familiar with the literature, so I don't know of anyone who writes about it. (It's probably worthwhile to have a look in Knuth.) Anyway, it seems like this is not too hard. Can you guarantee for some value of m that for each note N, only the first m notes following N might end earlier than N? If so, then the following simple algorithm is linear and runs in constant space. You could then use: sortAlmostBy m (comparing endTime) sortAlmostBy :: Int -> (a -> a -> Ordering) -> [a] -> [a] sortAlmostBy m cmp = mergeAABy cmp . map (sortBy cmp) . chunksOf m chunksOf :: Int -> [a] -> [[a]] chunksOf m = map (take m) . takeWhile (not . null) . iterate (drop m) -- Merge a list of lists with two assumptions: -- (1) Each list is sorted. -- (2) The sequence of lists is "almost ascending" - -- no element of a list is greater than any element of -- any list following it, except possibly the list immediately -- following it. mergeAABy :: (a -> a -> Ordering) -> [[a]] -> [a] mergeAABy cmp ((x:xs):xss) = y : mergeAABy cmp yss where (y, yss) = pickLeast x xs xss pickLeast p ps pss@((q:qs):qss) = case cmp p q of GT -> (q, (p : ps) : qs : qss) _ -> (p, ps : pss) pickLeast p ps (_:pss) = pickLeast p ps pss pickLeast p ps _ = (p, [ps]) mergeAABy cmp (_:xss) = mergeAABy cmp xss mergeAABy _ _ = [] You might be able to do a little better than this. Here is one way: GHC would probably optimize this better if you make pickLeast non-recursive by arranging for lists that become empty to be eliminated from the calculation earlier. But as David points out, this is probably good enough for your application, even if you are processing a full orchestra score of a Mahler symphony that lasts for hours. (Whereas just using sortBy for that case might be slow.) Regards, Yitz

Reply

Sign in to reply online Use email software

Daniel Fischer

28 Sep 28 Sep

6 a.m.

On Tuesday 27 September 2011, 04:46:29, Dennis Raddle wrote:

I have a problem from music. I have a list of notes sorted by begin time. I want to sort them by end time.

Notes that are sorted by begin time are usually close to sorted by end time, because notes tend to cluster within a small range of durations.

What algorithms are available to me (hopefully in a library) that are good with this kind of thing?

Data.List.sort/sortBy should be fine most of the time. Most algorithms that perform significantly better in some situations are rather specific and not available in libraries. Nobody will implement them until they're needed, and then the implementor may think that it's not worth publishing as a library as it's too specialised. Thus I doubt you'll find a library function adapted to your special needs. Leaving that aside, which algorithms are best depends on various things. The most important are, I think, what kind of almost-sortedness you have and what your space and laziness requirements are. Most sorting algorithms require O(n) space and can't lazily produce the output[1], but in your situation, that is possible (assuming the data is sufficiently nice, if the last note to begin is the first to end, you need O(n) space and can't produce incremental output). Regarding the kind of almost-sortedness, if you have long monotonic runs with few out-of-place elements in between, like [2 .. 1000] ++ [1] ++ [1002 .. 5000] ++ [1001] ++ [5001 .. 10000], Data.List.sort[By] will be quite good, less so if you have some jittering superimposed on a monotonic list, like concat [[n+1,n] | n <- [0, 2 .. 10000]]. I suppose your situation is more like the second. Then an insertion sort usually does rather well, it can often outperform algorithms with lower (worst case) complexity. (If the average displacement needed for sorting is d, insertion sort takes O(n*(d+1)) time; if d is much smaller than log n, insertion sort is very good.) Like Data.List's mergesort, insertion sort is generic, needs O(n) space, and can't produce incremental output. It is easy to implement, inSortBy cmp = foldr ins [] where ins x [] = [x] ins x (y:ys) = case cmp x y of GT -> y : ins x ys _ -> x : y : ys but all is not roses; for long input lists, that builds a large thunk which may blow your stack. Insertion sort works best on mutable arrays where you don't have the problem of building large thunks. However, it also works well on short enough lists. Benchmarking with Yitz's list generator (version 2), it can be more than twice as fast as Yitz's sortAlmostBy when the lists are at most a few thousand elements long and the average displacement is small, but it is much slower if the lists are a few ten- thousand elements long or the average displacement is large. Better at dealing with longer lists and larger displacements is the left- fold version of insertion sort, linSortBy cmp = reverse . foldl' ins [] where ins [] a = [a] ins l@(b:bs) a = case cmp a b of LT -> case ins bs a of k@(_:_) -> b : k [] -> error "impossible" _ -> a : l (note: This is adapted to the case of an almost sorted list, for an almost reverse-sorted list, you'd use the obvious modification of ins and not need the reverse at the end, giving better performance. For a more or less random list, no version of insertion sort does well.) The strictness is essential, using foldl instead of foldl' leads to worse performance than the right fold gives, making the insertion ins less strict isn't nearly as bad, but still hurts - a bit if the displacements are small, more if they're large. This version of insertion sort can keep up with sortAlmostBy much longer, but it suffers from the same problem as the right-fold insertion sort, only less. Moving to specifically tailored sorting algorithms, we have Yitz's nice sortAlmostBy. It requires that you know a bound for the number of notes beginning not before but ending before any given note, but if you do, it doesn't suffer too badly if you overestimate (as long as your estimate is still small relative to the length of the list). Of course, if you underestimate, it'll probably produce wrong results. Then, more specifically tailored to the problem, the algorithm posted by David Fletcher earlier, taking advantage of the fact that you know the starting times of the notes and that a note can't end before it began (and that the list is sorted by starting time). My (more generic) implementation of it: sortABy :: (a -> a -> Ordering) -> (a -> a -> Ordering) -> [a] -> [a] sortABy _ _ [] = [] sortABy cmp1 cmp2 (x:xs) = go [] x xs where ins a [] = [a] ins a l@(b:bs) = case cmp1 a b of LT -> a : l _ -> case ins a bs of k@(_:_) -> b : k [] -> error "oops" go !store y [] = y : store go store y zzs@(z:zs) = case cmp2 y z of GT -> case cmp1 y z of GT -> go (ins y store) z zs _ -> go (ins z store) y zs _ -> y : case store of (u:us) -> go us u zzs [] -> go [] z zs The first comparison function compares the end times, the second compares the end time of the first argument to the starting time of the second. As for the left-fold insertion sort, making the insertion stricter gains a bit of speed (or more than a bit for larger displacements). It has the advantage over sortAlmostBy that you don't need to know a bound, and under favourable circumstances (if there never are many notes beginning during any given note's lifetime), can be much faster (much meaning something like a factor of 1.5 or so). However, it doesn't take larger displacements well, and in extreme cases degrades to a badly adapted insertion sort. You could prevent that by using a good heap/priority queue for the store instead of the list. That would make it slower in the good cases, but guarantee better worst-case behaviour. Both specialised algorithms can produce incremental output and run in constant space if the preconditions are satisfied. Which is better depends on the nature of your data. Cheers, Daniel [1] Generic sorting algorithms can at best produce incremental output after the entire input has been partially processed, since the minimum could occur at any position. Thus they necessarily use at least O(n) space. [ignoring obscenities like sort [] = [] sort xs = let (x,ct) = findMinCount xs in replicate ct x ++ sort' x (recalculate xs) findMinCount (x:xs) = go 1 x xs where go k x [] = (x,k) go k x (y:ys) | y < x = go 1 y ys | y == x = go (k+1) x ys | otherwise = go k x ys sort' x xs = case dropWhile (<= x) xs of [] -> [] (y:ys) -> let (z,ct) = findMinCount1 x 1 y ys in replicate ct z ++ sort' z (recalculate xs) findMinCount1 x k y [] = (y,k) findMinCount1 x k y (z:zs) | z <= x = findMinCount1 x k y zs | z < y = findMinCount1 x 1 z zs | z == y = findMinCount1 x (k+1) y zs | otherwise = findMinCount1 x k y zs which relies on (==) identifying only truly indistinguishable values and a hypothetical 'recalculate' which recalculates the list lazily in O(1) space. But it was fun to come up with.]

Reply

Sign in to reply online Use email software

5163

Age (days ago)

5163

Last active (days ago)

Download

3 comments

4 participants

tags

participants (4)

Daniel Fischer
David Place
Dennis Raddle
Yitzchak Gale