Re: [GHC] #10830: maximumBy has a space leak

5 Jan 2017

      #10830: maximumBy has a space leak
-------------------------------------+-------------------------------------
        Reporter:  NeilMitchell      |                Owner:
            Type:  bug               |               Status:  new
        Priority:  high              |            Milestone:  8.2.1
       Component:  Core Libraries    |              Version:  7.10.2
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:                    |  Differential Rev(s):  Phab:D1205
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by ekmett):

 In response to Reid's prodding, I see two decisions to make here:

 1.) There are ultimately 3 possible default implementations of
 `minimumBy`/`maximumBy` based on `foldl1`, `foldr1`, or `foldMap`
 respectively.

 2.) We also need to decide whether we should add `maximumBy` and
 `minimumBy` to the class.

 `maximum` and `minimum` being in the class can improve the asymptotics of
 these operations. e.g. Set offers O(log n) `minimum` and `maximum` today,
 unlike before. Moreover, in the case of `minimum` and `maximum` 'a' is a
 fully knowable type for some containers, so they can even handle some
 infinite cases.

 Having them in the class also happened to also let us paper over the
 difference between the left fold and right fold in the old Foldable code
 for `maximum` and `minimum` for lists and fix the analogous space leak
 there. We didn't go with a right fold, but went with a monoidal fold
 instead as it was never asymptotically worse than the right fold we had in
 Foldable before.

 But here the situation is a bit different. Without more knowledge about
 the `b` type in the `(a -> b)` fitness function passed in there is no
 scenario in which a `foldr1`-based `maximumBy` can get early termination,
 so there is no real justification to the current implementation. It is
 leakier than a `foldl1` solution and never better!

 Even trying to argue there is some theoretical kind of snoclist container
 doesn't pan out, if our chosen semantics is to return the first value with
 the maximum value by fitness function, you'd have to reassociate all the
 way anyways.

 So at the very least there is no justification for a `foldr1` based
 solution.

 Ultimately, `maximumBy` and `minimumBy` are both operations that only make
 sense for finite containers. Here we're stuck touching all of the distinct
 members of the container at the least once as we don't know what fitness
 function the user will supply or any properties of the output type, to
 say, get early termination by reaching `minBound`/`maxBound`, or exploit
 properties of the type, like the laziness of "Lazy Nat" max.

 Adding `maximumBy`/`minimumBy` to the class would in theory allow a few
 obscure cases: The only real thing they can do is take advantage of the
 knowledge of which distinct 'a's they hold and send them through the
 scoring function individually without repetition. This would allow some
 containers to exploit a monoidal fold and pay proportional to the number
 of distinct values in the container rather than the total number of
 values. e.g. something like
 http://hackage.haskell.org/package/compressed-3.10/docs/Data-Compressed-
 LZ78.html could either just skim the tokens rather than do a full
 decompression, or use a monoidal fold to get most of that benefit in
 practice. If we have k distinct values, a monoidal fold version of
 `maximumBy` for some containers that tracked distinct values could get
 O(k) rather than O(n). Somewhat weaker, the `compressed` code above could
 pay proportional to the size of the compressed data set, not the full data
 set.

 Basing something on `foldMap` without the ability to redefine it in the
 class would effectively result in the bad `foldr`-style stack usage for
 lists.

 And frankly, it'd be nice to have a better story rather than 'lets favor
 monoidal implementations but hack in left folds for lists' pattern that
 seems to be cropping up over and over!

 Off the cuff, we could possibly address the root cause of this and a
 couple other issues by adding a `foldMap'` that uses a `foldl'` evaluation
 order on lists and strict monoidal accumulator and basing the (possibly
 default) definition of both `maximum` and `maximumBy` on that combinator.
 This appeals to me as it could serve to reduce pressure to add more
 members to the class of this sort, while avoiding more of these sorts of
 stack space regressions.

 I say "off the cuff" as I haven't had time to work through all the
 consequences of such a combinator. and it isn't the only design in this
 space, other designs in this space include making a single hideous
 Frankenstein combinator that takes multiple fold types worth of arguments
 and chooses an implementation accordingly, etc.

 A straw man solution would be to add `foldMap'` to the class, switch
 `maximum` and `maximumBy` to invoke it, and have it perform a monoidal
 fold in `foldl'` style either by default or just in the list case. As an
 embellishment on that design we could even add `sum'` and `product'`
 combinators that went through this path rather than randomly leak space.

 This would need some design work to work through consequences and
 implementation, but seems doable.

 **tl;dr** At the very least I agree that the current `foldr1`
 implementation of `maximumBy` and `minimumBy` is a mistake.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/10830#comment:26
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler