Re: [Haskell-beginners] sorting almost sorted list

I wrote:
Can you guarantee for some value of m that for each note N, only the first m notes following N might end earlier than N? If so, then the following simple algorithm is linear and runs in constant space. You could then use: sortAlmostBy m (comparing endTime) ...You might be able to do a little better than this.
After running some tests, it seems empirically that you can use m `div` 2 instead of m for lists of even length, which gives you a nice speedup for larger values of m. For lists of prime length, the best you can do seems to be m-1. It's more complicated for lists of non-prime odd length. I can't immediately see why any of that is true. Daniel to the rescue perhaps? test n k = fmap (all isSorted . map (sortAlmostBy n compare)) $ rands k isSorted xs = and . zipWith (<=) xs $ drop 1 xs rands d = fmap (take 100 . map (concat . zipWith (map . (+)) [0,d ..] . chunksOf d) . chunksOf 1000 . randomRs (0, d-1)) newStdGen Thanks, Yitz

I wrote:
After running some tests, it seems empirically that you can use m `div` 2 instead of m for lists of even length, which gives you a nice speedup for larger values of m. For lists of prime length, the best you can do seems to be m-1. It's more complicated for lists of non-prime odd length. I can't immediately see why any of that is true.
Well, obviously, that weird behavior has to do with the regularity of my test data. If I randomize more, the results become more believable. It seems that m-1 or m-2 work empirically, but that could just be because the probability of a collision is extremely low. The result for the more regularly shaped test data, where the results have an interesting dependence on prime factorization, is still fascinating. I hope Daniel will comment. :) Here is the more randomized test data (sorry, I really should be using QuickCheck for this instead of doing it by hand): test n k = fmap (all isSorted . map (sortAlmostBy n compare)) $ rands2 k isSorted xs = and . zipWith (<=) xs $ drop 1 xs rands1 d = do cs <- replicateM (2000 `div` d) $ randomRS (0, d-1) fmap concat $ zipWithM mkChunk (scanl (+) 0 cs) cs where mkChunk x y = replicateM y $ randomRS (x, x+y-1) rands2 d = fmap (evalState . replicateM 100 $ rands1 d) newStdGen

Sorry to keep replying to myself here. I wrote:
Here is the more randomized test data:
I forgot one function: randomRS = state . randomR "state" and "evalState" are from Control.Monad.Trans.State.
test n k = fmap (all isSorted . map (sortAlmostBy n compare)) $ rands2 k
isSorted xs = and . zipWith (<=) xs $ drop 1 xs
rands1 d = do cs <- replicateM (2000 `div` d) $ randomRS (0, d-1) fmap concat $ zipWithM mkChunk (scanl (+) 0 cs) cs where mkChunk x y = replicateM y $ randomRS (x, x+y-1)
rands2 d = fmap (evalState . replicateM 100 $ rands1 d) newStdGen

On Tuesday 27 September 2011, 13:46:52, Yitzchak Gale wrote:
I wrote:
After running some tests, it seems empirically that you can use m `div` 2 instead of m for lists of even length, which gives you a nice speedup for larger values of m. For lists of prime length, the best you can do seems to be m-1. It's more complicated for lists of non-prime odd length. I can't immediately see why any of that is true.
Well, obviously, that weird behavior has to do with the regularity of my test data. If I randomize more, the results become more believable. It seems that m-1 or m-2 work empirically, but that could just be because the probability of a collision is extremely low.
Yes. With less regular data, we need to consider all chunks of length d in the input, not only those starting at index k*d. If any of them meets three m-chunks (where m is the parameter to sortAlmostBy), the precondition of sortAlmostBy may be violated. The smallest m to guarantee the preconditions is then m = d-1. But with less regular data, if the list is almost monotonically increasing, the probability that the last element of a d-chunk is smaller than the first is significantly lower than 1/2, so you have a much bigger chance that m = d-2 will work than in the regular case for odd d. [To disambiguate: parenthesise as ... than (in the ... odd d).]

On Tuesday 27 September 2011, 11:32:35, Yitzchak Gale wrote:
I wrote:
Can you guarantee for some value of m that for each note N, only the first m notes following N might end earlier than N? If so, then the following simple algorithm is linear and runs in constant space. You could then use: sortAlmostBy m (comparing endTime) ...You might be able to do a little better than this.
After running some tests, it seems empirically that you can use m `div` 2 instead of m for lists of even length, which gives you a nice speedup for larger values of m. For lists of prime length, the best you can do seems to be m-1. It's more complicated for lists of non-prime odd length.
I can't immediately see why any of that is true.
Daniel to the rescue perhaps?
I have to confess, I'm not even sure what you mean. Okay, m is the maximal number of following notes that may end earlier than the current. But to which list(s) does the length refer?
test n k = fmap (all isSorted . map (sortAlmostBy n compare)) $ rands k
With what parameters did you call test? Prelude Almost System.Random> test 12 16 True (0.12 secs, 178509776 bytes) Prelude Almost System.Random> test 12 17 False (0.00 secs, 592904 bytes) The only big variation in runtimes I find is due to a violation of the preconditions of sortAlmostBy, when all shortcuts. (But the most of the runtime is used for generating the lists. If I completely evaluate such a list of lists prior to sorting, then m has a measurable impact on time even when all doesn't shortcut, however, in that situation it's basically monotonic in m, as expected).
isSorted xs = and . zipWith (<=) xs $ drop 1 xs
rands d = fmap (take 100 . map (concat . zipWith (map . (+)) [0,d ..] . chunksOf d) . chunksOf 1000 . randomRs (0, d-1)) newStdGen
You generate 100 lists of length 1000, such that the entries at index k*d <= i < (k+1)*d are between k*d (inclusive) and (k+1)*d (exclusive). Now, if you split those lists into chunks of length m, the preconditions of sortAlmostBy are only guaranteed to hold if each of the d-long chunks from the generation meets at most two m-long chunks. That is the case if (and only if, if the entire list is long enough) m + gcd m d >= d If m + gcd m d < d, you will sooner or later encounter a d-chunk meeting three (or more) m-chunks. The probability that the first element of that d-chunk is larger than the last is nearly 1/2 [(d-1)/2d], so it doesn't take many such situations until you get a non-sorted list with sortAlmostBy m. So, if running time is monotonic in m, under the assumption that the preconditions of sortAlmostBy hold, the optimal choice of m is min { m \in N : m + gcd m d >= d } For even d, that's d/2, for d prime it's d-1, generally, it's (1 - 1/p)*d, where p is the smallest prime factor of d. I haven't analysed your merging function, but at first glance, merging c chunks of length m each is O(c*m); so, since sorting each chunk is O(m*log m), it'd overall be O(c*m*log m) = O(n*log m) = O(n*log (n/c)) It has larger overhead than sortBy, so if n is small or m is large, sortBy is likely to be better, but if n is large and m relatively small, sortAlmostBy can be significantly (but not dramatically, excluding extreme cases where you can do much better) faster.

Hi Daniel, Thanks for clearing up the mystery about the dependence on prime factorization! Very interesting.
at first glance, merging c chunks of length m each is O(c*m); so, since sorting each chunk is O(m*log m), it'd overall be
O(c*m*log m) = O(n*log m) = O(n*log (n/c))
No. In the given use case m is a small constant and c is proportional to n, so the overall complexity is O(n).
It has larger overhead than sortBy,
That's true. But it's asymptotically better. Given Dennis' use case, it sounds like m will probably always be < 10. Input size will typically be a few hundred, but could be a few thousand. sortBy would indeed be just fine in those cases.
so if n is small or m is large, sortBy is likely to be better, but if n is large and m relatively small, sortAlmostBy can be significantly (but not dramatically, excluding extreme cases where you can do much better) faster.
At most m will be on the order of 100 if we need to process all the parts of a large symphony orchestra combined together (a very unlikely use case admittedly), and in that case the input size could be as much as 10^6, or possibly even more for something like the Mahler. (Come to think of it, it's pretty amazing that Mahler wrote so many notes and actually got people to play them.) There I believe sortAlmostBy could be quite an improvement. I'd be interested to hear ideas to make it even better though. Thanks, Yitz

On Tuesday 27 September 2011, 18:34:01, Yitzchak Gale wrote:
Hi Daniel,
Thanks for clearing up the mystery about the dependence on prime factorization! Very interesting.
at first glance, merging c chunks of length m each is O(c*m); so, since sorting each chunk is O(m*log m), it'd overall be
O(c*m*log m) = O(n*log m) = O(n*log (n/c))
No. In the given use case m is a small constant and c is proportional to n, so the overall complexity is O(n).
Not a contradiction. If m is a constant (> 1), O(n*log m) = O(n) (and sorting chunks of length m is O(m*log m) = O(1) then).
It has larger overhead than sortBy,
That's true. But it's asymptotically better.
Given Dennis' use case, it sounds like m will probably always be < 10. Input size will typically be a few hundred, but could be a few thousand. sortBy would indeed be just fine in those cases.
so if n is small or m is large, sortBy is likely to be better, but if n is large and m relatively small, sortAlmostBy can be significantly (but not dramatically, excluding extreme cases where you can do much better) faster.
At most m will be on the order of 100 if we need to process all the parts of a large symphony orchestra combined together (a very unlikely use case admittedly), and in that case the input size could be as much as 10^6, or possibly even more for something like the Mahler.
Yeah, but log (10^6) is still a fairly small number, so counting reduction steps, it wouldn't be dramatically better. However, I forgot to take memory usage into account, so in terms of wall-clock (or CPU) time, it could be dramatically better even for not too large n and no too small m.
(Come to think of it, it's pretty amazing that Mahler wrote so many notes and actually got people to play them.)
People do stranger things. People listen to Wagner, for example.
There I believe sortAlmostBy could be quite an improvement.
Certainly. (Just included Data.List.sort in the benchmark, for these short lists, it takes less than twice as long as sortAlmostBy)
I'd be interested to hear ideas to make it even better though.
mergeAABy :: (a -> a -> Ordering) -> [[a]] -> [a] mergeAABy cmp ((x:xs):(y:ys):zss) = foo x xs y ys zss where foo u us v vs wss = case cmp u v of GT -> v : case vs of (b:bs) -> foo u us b bs wss [] -> u : us ++ mergeAABy cmp wss _ -> u : case us of (c:cs) -> foo c cs v vs wss [] -> case wss of ((w:ws):kss) -> foo v vs w ws kss _ -> v : vs mergeAABy _ [xs] = xs mergeAABy _ _ = [] seems to be a bit faster (I have made the lists longer, substituted 2000 by 5000): dafis@schwartz:~/Haskell/BeginnersTesting> ./benchAlmost -nis 20 -kis 20 (20,20) -- print parameters to make sure they're evaluated 284090563 -- print sum of all random Ints to make sure they're evaluated warming up estimating clock resolution... mean is 2.183860 us (320001 iterations) found 41404 outliers among 319999 samples (12.9%) 2443 (0.8%) low severe 38961 (12.2%) high severe estimating cost of a clock call... mean is 50.65112 ns (14 iterations) found 2 outliers among 14 samples (14.3%) 2 (14.3%) high mild benchmarking Yitz collecting 100 samples, 1 iterations each, in estimated 6.537795 s mean: 67.05025 ms, lb 66.59363 ms, ub 67.62786 ms, ci 0.950 std dev: 2.616184 ms, lb 2.179318 ms, ub 3.100722 ms, ci 0.950 benchmarking Mine collecting 100 samples, 1 iterations each, in estimated 5.566311 s mean: 56.91069 ms, lb 56.54846 ms, ub 57.37602 ms, ci 0.950 std dev: 2.100948 ms, lb 1.722233 ms, ub 2.585056 ms, ci 0.950 With n = k = 50: benchmarking Yitz collecting 100 samples, 1 iterations each, in estimated 7.366776 s mean: 75.63603 ms, lb 75.17202 ms, ub 76.20199 ms, ci 0.950 std dev: 2.617970 ms, lb 2.227238 ms, ub 3.040563 ms, ci 0.950 benchmarking Mine collecting 100 samples, 1 iterations each, in estimated 6.517696 s mean: 68.00708 ms, lb 67.26089 ms, ub 69.04105 ms, ci 0.950 std dev: 4.463628 ms, lb 3.484169 ms, ub 5.950722 ms, ci 0.950 n = k = 100: benchmarking Yitz collecting 100 samples, 1 iterations each, in estimated 9.756303 s mean: 87.39906 ms, lb 86.90397 ms, ub 87.99232 ms, ci 0.950 std dev: 2.763955 ms, lb 2.382008 ms, ub 3.168909 ms, ci 0.950 benchmarking Mine collecting 100 samples, 1 iterations each, in estimated 7.505393 s mean: 77.05801 ms, lb 76.51954 ms, ub 77.74727 ms, ci 0.950 std dev: 3.117756 ms, lb 2.594126 ms, ub 3.769922 ms, ci 0.950
participants (2)
-
Daniel Fischer
-
Yitzchak Gale