
Lets look at the actual reductions going on. To make the example easier, I would like to use last instead of your complicated until. It shouldn't make a difference.
[ winds up doing twice as much work ] This was also my intuition. I had a function that built up a large output list by generating chunks and ++ them onto the output (e.g. basically concatMap). My intuition was that this would get gradually slower because each (++) implied a copy of the previous part of the list. So if I have another function consuming the eventual output it will get slower and slower because demanding an element will reduce the (++)s for each chunk, and copy the element 1000 times if I have appended 1000 chunks by now. So I switched to using DList, which has O(1) append. Then I ran timing on a list that wound up adding up a million or so chunks... and both versions were exactly the same speed (and -O2 gave both an equally big speed boost). In fact, when I try with a toy example, the DList using one is much slower than the concatMap using one, which I don't quite understand. Shouldn't concatMap be quite inefficient? The program below gives me this output: 15000000 45.7416 15000000 110.2112 -O2 brings both implementations to 45. Interestingly, if I call 't1' from ghci, it sucks up 1gb of memory and slows the system to a crawl until I kill it.. this is *after* it's computed a result and given me my prompt back... even if its somehow keeping the whole list around in some ghci variable (where?), isn't 1gb+ a lot even for 15m boxed Integers? And why does it continue to grow after the function has completed? The compiled version doesn't have this problem. import System.CPUTime import qualified Data.DList as DList dconcat_map f xs = DList.toList (DList.concat (map (DList.fromList . f) xs)) mkchunk n = [n, n*2, n*3] main = do t1 t2 t1 = do let a = concatMap mkchunk [0..5000000] t <- getCPUTime print (last a) t2 <- getCPUTime print (fromIntegral (t2 - t) / fromIntegral cpuTimePrecision) t2 = do let a = dconcat_map mkchunk [0..5000000] t <- getCPUTime print (last a) t2 <- getCPUTime print (fromIntegral (t2 - t) / fromIntegral cpuTimePrecision)