Unnecessarily strict implementations

Hi, there is a new ticket that Data.List.intersperse is not as non-strict as possible (http://hackage.haskell.org/trac/ghc/ticket/4282). I have observed some other functions which are unnecessarily strict and it might be advantageous to change their definitions as well. I think it is known that the current implementations of inits and tails are too strict. At least I think I have once read a post to haskell-cafe about this topic. Furthermore intersect is too strict. We have intersect _|_ [] = _|_ and the current implementation is linear in the size of xs if we evaluate intersect xs []. I think simply adding a rule intersect _ [] = [] to the current implementation solves this issue. The implication (<=) :: Bool -> Bool -> Bool is too strict as well. We have False <= _|_ = _|_ as well as _|_ <= True = _|_ while one of these cases could yield True. The problem is that (<=) is defined by means of compare. This effect shows up for all data types with a least element if the default implementation of (<=) by means of compare is used. Furthermore there are a couple of functions which are too strict but whose minimally strict implementations do not provide benefits. For example, reverse is too strict as has already been observed by Olaf Chitil (http://www.cs.kent.ac.uk/people/staff/oc/Talks/ifl2006NonStrictProgramming.p... ). Cheers, Jan

On Thursday 02 September 2010 00:05:03, Jan Christiansen wrote:
Hi,
there is a new ticket that Data.List.intersperse is not as non-strict as possible (http://hackage.haskell.org/trac/ghc/ticket/4282).
It's not that it's not as non-strict as possible per se. (Sorry, had to :) It's that intersperse's current definition (in GHC at least) can cause a space leak. In this case, making the function less strict can cure it, in other cases, more strictness might be the solution.
I have observed some other functions which are unnecessarily strict and it might be advantageous to change their definitions as well.
I think it is known that the current implementations of inits and tails are too strict. At least I think I have once read a post to haskell-cafe about this topic.
It's been mentioned. I don't see any drawbacks to making them less strict, so I'd support that.
Furthermore intersect is too strict. We have intersect _|_ [] = _|_
On the other hand, we currently have intersect [] _|_ = [] and one of intersect _|_ [] and intersect [] _|_ must give _|_. Which one is a matter of choice.
and the current implementation is linear in the size of xs if we evaluate intersect xs [].
Yes, that's bad.
I think simply adding a rule intersect _ [] = [] to the current implementation solves this issue.
And before that, the rule intersect [] _ = [] if the current behaviour of intersect [] should be retained.
The implication (<=) :: Bool -> Bool -> Bool is too strict as well. We have False <= _|_ = _|_ as well as _|_ <= True = _|_ while one of these cases could yield True.
I'm not convinced either should (nor that they shouldn't).
The problem is that (<=) is defined by means of compare. This effect shows up for all data types with a least element if the default implementation of (<=) by means of compare is used.
Furthermore there are a couple of functions which are too strict but whose minimally strict implementations do not provide benefits. For example, reverse is too strict as has already been observed by Olaf Chitil (http://www.cs.kent.ac.uk/people/staff/oc/Talks/ifl2006NonStrictProgramm ing.pdf ).
The last slide lists among the problems "proposes undesirably inefficient functions (reverse)". I wouldn't equate 'not minimally strict' with 'too strict'. Minimal strictness also can have negative effects, one must look at each case individually.
Cheers, Jan

Hi, On 02.09.2010, at 01:35, Daniel Fischer wrote:
It's not that it's not as non-strict as possible per se. (Sorry, had to :) It's that intersperse's current definition (in GHC at least) can cause a space leak. In this case, making the function less strict can cure it, in other cases, more strictness might be the solution.
I would be very happy if you would share this example with me. I am looking for an example where the current implementation of intersperse or inits causes a space leak for quite a while now.
On the other hand, we currently have
intersect [] _|_ = []
and one of intersect _|_ [] and intersect [] _|_ must give _|_. Which one is a matter of choice.
I am sorry for not being precise. You are right. But right now we have intersect xs [] = _|_ for every list xs terminated by _|_. But I suffices to evaluate xs to head normal to decide that the result should be []. That is, we could have intersect [] _|_ = [] and intersect (_|_:_|_) [] = [] or intersect [] (_|_:_|_) = [] and intersect _|_ [] = [] and the current implementation satisfies neither.
And before that, the rule intersect [] _ = [] if the current behaviour of intersect [] should be retained.
That's a deal.
The implication (<=) :: Bool -> Bool -> Bool is too strict as well. We have False <= _|_ = _|_ as well as _|_ <= True = _|_ while one of these cases could yield True.
I'm not convinced either should (nor that they shouldn't).
I think this is a matter of elegance rather than a matter of efficiency. In the same way as I prefer False && _|_ = False over False && _|_ = _|_ I prefer False <= _|_ = True over False <= _|_ = _|_
The last slide lists among the problems "proposes undesirably inefficient functions (reverse)". I wouldn't equate 'not minimally strict' with 'too strict'. Minimal strictness also can have negative effects, one must look at each case individually.
I second this but in my opinion the minimally strict implementation should be the default if there is no reason against it. Cheers, Jan

On Thursday 02 September 2010 09:25:59, Jan Christiansen wrote:
Hi,
On 02.09.2010, at 01:35, Daniel Fischer wrote:
It's not that it's not as non-strict as possible per se. (Sorry, had to :) It's that intersperse's current definition (in GHC at least) can cause a space leak. In this case, making the function less strict can cure it, in other cases, more strictness might be the solution.
I would be very happy if you would share this example with me. I am looking for an example where the current implementation of intersperse or inits causes a space leak for quite a while now.
I don't see how the current implementation of inits or tails could cause a space leak that the lazier versions wouldn't, so you'd have to wait longer for such an example. For intersperse, $ cabal update && cabal install stringsearch You need the new version 0.3.1, Data.ByteString.Lazy.Search[.DFA].splitXXX had their own space leak in 0.3.0 [caused by too much laziness]. Then =========================================== {-# LANGUAGE BangPatterns #-} module Main (main) where import System.Environment (getArgs) import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Char8 as C import Data.ByteString.Lazy.Search (split) main :: IO () main = do (file : pat : sub : _ ) <- getArgs let !spat = C.pack pat !ssub = L.fromChunks [C.pack sub] work = ical ssub . split spat L.readFile file >>= L.putStrLn . L.take 100 . work ical :: L.ByteString -> [L.ByteString] -> L.ByteString ical new = L.concat . intersperse new intersperse :: a -> [a] -> [a] intersperse sep [] = [] intersperse sep (x:xs) = x : go xs where go [] = [] go (y:ys) = sep : y : go ys ============================================ has no space leak, if you replace the local intersperse with Data.List.intersperse (equivalent, ical = L.intercalate), you have a space leak. To expose the leak, take a sufficiently large file (say 10MB or larger) and replace a pattern that does not occur in the file or occurs late in the file, $ ./noleak file pat sub runs fast in small memory, $ ./leak file pat sub takes a little to run and keeps the entire file until the first occurrence of pat in memory. Note that the above implementation of intersperse has different semantics from Data.List.intersperse, Data.List.intersperse ',' ('a':_|_) = _|_ intersperse ',' ('a':_|_) = 'a':_|_ Data.List.intersperse ',' ('a':'b':_|_) = 'a' : ',' : _|_ intersperse ',' ('a':'b':_|_) = 'a' : ',' : 'b' : _|_ etc.
On the other hand, we currently have
intersect [] _|_ = []
and one of intersect _|_ [] and intersect [] _|_ must give _|_. Which one is a matter of choice.
I am sorry for not being precise. You are right. But right now we have intersect xs [] = _|_ for every list xs terminated by _|_. But I suffices to evaluate xs to head normal to decide that the result should be []. That is, we could have
intersect [] _|_ = [] and intersect (_|_:_|_) [] = []
or
intersect [] (_|_:_|_) = [] and intersect _|_ [] = []
and the current implementation satisfies neither.
Right. So the question is, has the current implementation advantages over either of these? (I don't see any.) If not, which of these two behaviours is preferable?
And before that, the rule intersect [] _ = [] if the current behaviour of intersect [] should be retained.
That's a deal.
The implication (<=) :: Bool -> Bool -> Bool is too strict as well. We have False <= _|_ = _|_ as well as _|_ <= True = _|_ while one of these cases could yield True.
I'm not convinced either should (nor that they shouldn't).
I think this is a matter of elegance rather than a matter of efficiency. In the same way as I prefer
False && _|_ = False
over
False && _|_ = _|_
I prefer
False <= _|_ = True
over
False <= _|_ = _|_
I have mixed feelings about those. Part of me dislikes breaking the symmetry between (<=), (==) and compare.
The last slide lists among the problems "proposes undesirably inefficient functions (reverse)". I wouldn't equate 'not minimally strict' with 'too strict'. Minimal strictness also can have negative effects, one must look at each case individually.
I second this but in my opinion the minimally strict implementation should be the default if there is no reason against it.
Agreed - except I have to object to your use of the definite article, some functions have several minimally strict implementations. (Ambiguity of minimal strictness *can* be a reason for a stricter choice, though probably rarely.)
Cheers, Jan
Cheers, Daniel

Hi, On 02.09.2010, at 13:41, Daniel Fischer wrote:
takes a little to run and keeps the entire file until the first occurrence of pat in memory.
first of all thanks very much for the detailed instructions. I have rewritten the example slightly using Strings instead of Bytestrings. Replacing all occurrences of 'ä' by "ä" in the collected works of Shakespeare ; ) has a maximum memory usage of around 65MB with the current implementation of intersperse while it has a maximum memory usage of only around 5KB with the less strict implementation. replaceBy :: Eq alpha => alpha -> [alpha] -> [alpha] -> [alpha] replaceBy x sep = concat . intersperse sep . splitBy (==x) splitBy :: (alpha -> Bool) -> [alpha] -> [[alpha]] splitBy _ [] = [] splitBy p xs = case break p xs of (l,ys) -> l : case ys of [] -> [] (_:zs) -> splitBy p zs This function only runs in constant space if I use the strict pattern matching on the result of break. If I use the following implementation I observe a linear memory usage instead. splitBy' :: (alpha -> Bool) -> [alpha] -> [[alpha]] splitBy' _ [] = [] splitBy' p xs = l : case ys of [] -> [] (_:zs) -> splitBy' p zs where (l,ys) = break p xs I think this is due to the Wadler tuple space leak. The same would apply to the current implementation of lines. I wonder whether an implementation of lines analogous to splitBy has any disadvantages.
That is, we could have
intersect [] _|_ = [] and intersect (_|_:_|_) [] = []
or
intersect [] (_|_:_|_) = [] and intersect _|_ [] = []
and the current implementation satisfies neither.
Right. So the question is, has the current implementation advantages over either of these? (I don't see any.) If not, which of these two behaviours is preferable?
I'd prefer the first one as it is in line with the left to right pattern matching of Haskell.
I have mixed feelings about those. Part of me dislikes breaking the symmetry between (<=), (==) and compare.
I think you should not blame (<=) for the existence of a function that yields a superset of the information that (<=) yields ; ) Cheers, Jan

On Friday 03 September 2010 00:22:14, Jan Christiansen wrote:
Hi,
On 02.09.2010, at 13:41, Daniel Fischer wrote:
takes a little to run and keeps the entire file until the first occurrence of pat in memory.
first of all thanks very much for the detailed instructions.
I have rewritten the example slightly using Strings instead of Bytestrings. Replacing all occurrences of 'ä' by "ä" in the collected works of Shakespeare ; ) has a maximum memory usage of around 65MB with the current implementation of intersperse while it has a maximum memory usage of only around 5KB with the less strict implementation.
No surprise, there aren't many 'ä's in Shakespeare's works, are there?
I think this is due to the Wadler tuple space leak.
Yup.
The same would apply to the current implementation of lines. I wonder whether an implementation of lines analogous to splitBy has any disadvantages.
Hardly, but yes. 'break' constructs a pair pretty immediately, so case break p (x:xs) of (pre, post) -> pre : case post of [] -> [] (y:ys) -> stuff can only do harm if (p x) diverges, but then it does. Currently, lines (_|_ : rest) = _|_ : _|_ while withe the break, we'd have lines' (_|_ : rest) = _|_ On the other hand, the current implementation of lines does not seem to suffer from Wadler's tuple space leak (according to one test I made), so I'd stick with the current implementation for the time being.
That is, we could have
intersect [] _|_ = [] and intersect (_|_:_|_) [] = []
or
intersect [] (_|_:_|_) = [] and intersect _|_ [] = []
and the current implementation satisfies neither.
Right. So the question is, has the current implementation advantages over either of these? (I don't see any.) If not, which of these two behaviours is preferable?
I'd prefer the first one as it is in line with the left to right pattern matching of Haskell.
Moi aussi.

Jan Christiansen schrieb:
Hi,
On 02.09.2010, at 01:35, Daniel Fischer wrote:
It's not that it's not as non-strict as possible per se. (Sorry, had to :) It's that intersperse's current definition (in GHC at least) can cause a space leak. In this case, making the function less strict can cure it, in other cases, more strictness might be the solution.
I would be very happy if you would share this example with me. I am looking for an example where the current implementation of intersperse or inits causes a space leak for quite a while now.
I'm also annoyed by several space leaks. These are implementation artifacts, but they seem to be hard to avoid by the existing implementations. You cannot reason about space leaks, right? That's a pity.
I think this is a matter of elegance rather than a matter of efficiency. In the same way as I prefer
False && _|_ = False
over
False && _|_ = _|_
I think this one is justified by the law: filter p (filter q xs) = filter (\x -> q x && p x) xs which only hold with the first definition of (&&). E.g. q x = x/=0 p x = div n x

On Thu, Sep 2, 2010 at 3:25 AM, Jan Christiansen
I prefer
False <= _|_ = True
Sorry to go a bit off topic, but I find it funny that I never really noticed you could perform less-than or greater-than comparisons on Bool values. What's the semantic reasoning behind allowing relative comparisons on booleans? In what context would you use it? It seems to me a throwback to C's somewhat arbitrary assumption that False=0 and True=1. Steve

On Thu, 2 Sep 2010, Stephen Sinclair wrote:
On Thu, Sep 2, 2010 at 3:25 AM, Jan Christiansen
wrote: I prefer
False <= _|_ = True
Sorry to go a bit off topic, but I find it funny that I never really noticed you could perform less-than or greater-than comparisons on Bool values. What's the semantic reasoning behind allowing relative comparisons on booleans? In what context would you use it?
You might use or abuse (<=) for implication, however the arrow points to the wrong direction.
It seems to me a throwback to C's somewhat arbitrary assumption that False=0 and True=1.
My feeling is similar. The Ord instance for Bool might be justified for using Bools as keys of Data.Map, however you can also consider using Ord for Map as abuse.

On Thursday 02 September 2010 18:25:11, Henning Thielemann wrote:
The Ord instance for Bool might be justified for using Bools as keys of Data.Map, however you can also consider using Ord for Map as abuse.
Why would one consider using Ord for Map an abuse? A kludge, for performance reasons, but an abuse?

On Thu, 2 Sep 2010 19:30:17 +0200, Daniel Fischer
Why would one consider using Ord for Map an abuse? A kludge, for performance reasons, but an abuse?
Because it forces one to declare Ord instances for types which have no natural ordering. It is useful to *not* have such instances, in order to catch programming errors. A separate type class for types which can be ordered in some (possibly arbitrary) way, for use in Data.Map, would remedy this. Regards, Arie

On 3 September 2010 04:57, Arie Peterson
On Thu, 2 Sep 2010 19:30:17 +0200, Daniel Fischer
wrote: Why would one consider using Ord for Map an abuse? A kludge, for performance reasons, but an abuse?
Because it forces one to declare Ord instances for types which have no natural ordering. It is useful to *not* have such instances, in order to catch programming errors.
What precisely do you mean by natural ordering?
A separate type class for types which can be ordered in some (possibly arbitrary) way, for use in Data.Map, would remedy this.
Sure... except that the way Data.Map and Data.Set are implemented is by a binary tree, and you typically want some kind of ordering for those. How is a type class that represents arbitrary ordering any different from what we already have? The notation might not be the best if you consider the ordering to be arbitrary, but what else would you use? "isArbitrarilyBefore :: (ArbitraryOrdering a) => a -> a -> Bool" ? -- Ivan Lazar Miljenovic Ivan.Miljenovic@gmail.com IvanMiljenovic.wordpress.com

On Fri, 3 Sep 2010 12:02:22 +1000, Ivan Lazar Miljenovic
What precisely do you mean by natural ordering?
An ordering that has relevant meaning for the information represented by the datatype. Ideally, it should also be alone in being the order anyone would expect this datatype to have (because instances are global). (If there is not a single most obvious ordering, then don't define an instance for Ord – chances are someone will use it with the wrong expectations. We may use newtypes instead.)
A separate type class for types which can be ordered in some (possibly arbitrary) way, for use in Data.Map, would remedy this.
Sure... except that the way Data.Map and Data.Set are implemented is by a binary tree, and you typically want some kind of ordering for those.
Yes, so you would use the (possibly arbitrary) ordering provided by the new class.
How is a type class that represents arbitrary ordering any different from what we already have?
The important thing is that there are *two* classes: one for "natural", semantic orderings, and one for arbitrary orderings. Henning's example of the Gaussian integers is excellent. We would be able to have Sets of gaussians, and still catch mistaken uses of '<' on them. There is a further advantage to this separation. Some types may have a "natural", obvious ordering that is hard to compute, while their representation also allows a fast comparison, that has nothing to do with the semantics of the type (is "arbitrary"). Further, if you change the representation, you can also change to a new arbitrary ordering, which is more efficient in this new situation, without ever touching the semantic ordering, so users of the type need not know.
The notation might not be the best if you consider the ordering to be arbitrary, but what else would you use? "isArbitrarilyBefore :: (ArbitraryOrdering a) => a -> a -> Bool" ?
Yes, something like that. If you use it a lot, say in the implementation of Data.Set, you can make a nice local operator alias (say, ≼ or ⊑). Regards, Arie

On 02/09/10 17:10, Stephen Sinclair wrote:
On Thu, Sep 2, 2010 at 3:25 AM, Jan Christiansen
wrote: I prefer
False<= _|_ = True
Sorry to go a bit off topic, but I find it funny that I never really noticed you could perform less-than or greater-than comparisons on Bool values. What's the semantic reasoning behind allowing relative comparisons on booleans? In what context would you use it? It seems to me a throwback to C's somewhat arbitrary assumption that False=0 and True=1.
Comparison on Bool itself is probably not particularly useful. But it is often useful if the Bool is part of a larger data structure. For example, I might want to have Set (String, Bool); without the Ord instance on Bool I couldn't do this. Similarly, you couldn't derive Ord on your data types that have Bool in them without the Ord Bool instance. Thanks, Neil.

On 10-09-02 12:10 PM, Stephen Sinclair wrote:
Sorry to go a bit off topic, but I find it funny that I never really noticed you could perform less-than or greater-than comparisons on Bool values. What's the semantic reasoning behind allowing relative comparisons on booleans? In what context would you use it? It seems to me a throwback to C's somewhat arbitrary assumption that False=0 and True=1.
A boolean algebra is also a lattice. The lattice order reflects implication. If the boolean algebra is furthermore 2-valued, the order is total too. The only possible dispute among people is over False<=True vs True<=False, i.e., should "<=" be "implies" or "implied by". But it is always correct to pick one and stick with it. This is useful wherever implication is useful. Suppose you wrote two functions and you now want to prove or quickcheck-test that they are extensionally equal, except you only care about arguments satisfying an assumption. You are checking (p x) implies (f x == g x) If you follow the Haskell choice, it is directly (p x) <= (f x == g x) Implication is pervasively useful, but people pervasively underappreciate it; they avoid to think in terms of implication when the essence is implication. For example, some people say: you can use "not (p x) || f x == g x". To that, I reply firstly: you can use solely NAND too, why don't you try that. And I reply secondly: implication is more direct, one single operator for the pervasive assume-guarantee paradigm, minimum boilerplate. There is no need to bring in negation.

On Sep 2, 2010, at 9:10 AM, Stephen Sinclair wrote:
Sorry to go a bit off topic, but I find it funny that I never really noticed you could perform less-than or greater-than comparisons on Bool values. What's the semantic reasoning behind allowing relative comparisons on booleans? In what context would you use it?
The Boolean values form a Boolean lattice. That's reason enough.
It seems to me a throwback to C's somewhat arbitrary assumption that False=0 and True=1.
That's not arbitrary at all. 0 and 1 are very special numbers, because of the roles they play in addition and multiplication. They "absorb" and "identify" things. http://en.wikipedia.org/wiki/Boolean_algebra_(structure)
participants (10)
-
Albert Y. C. Lai
-
Alexander Solla
-
Arie Peterson
-
Daniel Fischer
-
Henning Thielemann
-
Henning Thielemann
-
Ivan Lazar Miljenovic
-
Jan Christiansen
-
Neil Brown
-
Stephen Sinclair