
#10830: maximumBy has a space leak -------------------------------------+------------------------------------- Reporter: NeilMitchell | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Core Libraries | Version: 7.10.2 Resolution: | Keywords: Operating System: Unknown/Multiple | Architecture: Type of failure: Runtime | Unknown/Multiple performance bug | Test Case: Blocked By: | Blocking: Related Tickets: | Differential Rev(s): Phab:D1205 Wiki Page: | -------------------------------------+------------------------------------- Comment (by ekmett): In response to Reid's prodding, I see two decisions to make here: 1.) There are ultimately 3 possible default implementations of `minimumBy`/`maximumBy` based on `foldl1`, `foldr1`, or `foldMap` respectively. 2.) We also need to decide whether we should add `maximumBy` and `minimumBy` to the class. `maximum` and `minimum` being in the class can improve the asymptotics of these operations. e.g. Set offers O(log n) `minimum` and `maximum` today, unlike before. Moreover, in the case of `minimum` and `maximum` 'a' is a fully knowable type for some containers, so they can even handle some infinite cases. Having them in the class also happened to also let us paper over the difference between the left fold and right fold in the old Foldable code for `maximum` and `minimum` for lists and fix the analogous space leak there. We didn't go with a right fold, but went with a monoidal fold instead as it was never asymptotically worse than the right fold we had in Foldable before. But here the situation is a bit different. Without more knowledge about the `b` type in the `(a -> b)` fitness function passed in there is no scenario in which a `foldr1`-based `maximumBy` can get early termination, so there is no real justification to the current implementation. It is leakier than a `foldl1` solution and never better! Even trying to argue there is some theoretical kind of snoclist container doesn't pan out, if our chosen semantics is to return the first value with the maximum value by fitness function, you'd have to reassociate all the way anyways. So at the very least there is no justification for a `foldr1` based solution. Ultimately, `maximumBy` and `minimumBy` are both operations that only make sense for finite containers. Here we're stuck touching all of the distinct members of the container at the least once as we don't know what fitness function the user will supply or any properties of the output type, to say, get early termination by reaching `minBound`/`maxBound`, or exploit properties of the type, like the laziness of "Lazy Nat" max. Adding `maximumBy`/`minimumBy` to the class would in theory allow a few obscure cases: The only real thing they can do is take advantage of the knowledge of which distinct 'a's they hold and send them through the scoring function individually without repetition. This would allow some containers to exploit a monoidal fold and pay proportional to the number of distinct values in the container rather than the total number of values. e.g. something like http://hackage.haskell.org/package/compressed-3.10/docs/Data-Compressed- LZ78.html could either just skim the tokens rather than do a full decompression, or use a monoidal fold to get most of that benefit in practice. If we have k distinct values, a monoidal fold version of `maximumBy` for some containers that tracked distinct values could get O(k) rather than O(n). Somewhat weaker, the `compressed` code above could pay proportional to the size of the compressed data set, not the full data set. Basing something on `foldMap` without the ability to redefine it in the class would effectively result in the bad `foldr`-style stack usage for lists. And frankly, it'd be nice to have a better story rather than 'lets favor monoidal implementations but hack in left folds for lists' pattern that seems to be cropping up over and over! Off the cuff, we could possibly address the root cause of this and a couple other issues by adding a `foldMap'` that uses a `foldl'` evaluation order on lists and strict monoidal accumulator and basing the (possibly default) definition of both `maximum` and `maximumBy` on that combinator. This appeals to me as it could serve to reduce pressure to add more members to the class of this sort, while avoiding more of these sorts of stack space regressions. I say "off the cuff" as I haven't had time to work through all the consequences of such a combinator. and it isn't the only design in this space, other designs in this space include making a single hideous Frankenstein combinator that takes multiple fold types worth of arguments and chooses an implementation accordingly, etc. A straw man solution would be to add `foldMap'` to the class, switch `maximum` and `maximumBy` to invoke it, and have it perform a monoidal fold in `foldl'` style either by default or just in the list case. As an embellishment on that design we could even add `sum'` and `product'` combinators that went through this path rather than randomly leak space. This would need some design work to work through consequences and implementation, but seems doable. **tl;dr** At the very least I agree that the current `foldr1` implementation of `maximumBy` and `minimumBy` is a mistake. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10830#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler