Performance horrors

older
Generalize groupBy in a useful way?

Adrian Hey

26 Aug 2008 26 Aug '08

7:38 a.m.

Hello Folks, I was looking at the definitions of nub (and hence nubBy) recently in connection with a trac ticket and realised that this is O(n^2) complexity! Ouch! I was going to say I sure hope nobody uses this in real programs, but I thought I'd better check first and I see that darcs seems to use it in a few places. Hmm.. How did we ever get stuff like this in the standard libs? I can only imagine this is relic from the days when Haskell was only used for research or pedagogical purposes only (not for real programs). Seeing as practically all Eq instances are also Ord instances, at the very least we could have O(n*(log n)) definitions for .. nub :: Ord a => [a] -> [a] nub = nubBy compare nubBy :: (a -> a -> Ordering) -> [a] -> [a] nubBy cmp xs ys = -- something using an AVL tree perhaps. ..or even better using the generalised trie stuff Jamie Brandon has been working on. Of course I'm not actually advocating this as it's probably bad policy to have a simple standard lib dependent on any complex non-standard lib. But if it just isn't possible to implement some functions with reasonable efficiency using simple definitions only then I think they really shouldn't be there at all. Instead we should make users "roll their own" using whatever complex non-standard lib they want. So could nub and nubBy be deprecated and an appropriate health warning added to the Haddock? In the mean time I think I'll put appropriate alternative definitions in the next AVL release and ask Jamie Brandon to consider doing the same in his generalised tries lib (GMap). Also, looking at a few other functions there like union(By) and intersection(By) make me quite suspicious. Maybe we need a thorough performance audit to get rid of implementations that are so insanely inefficient they should *never* be used. Regards -- Adrian Hey

Show replies by date

Neil Mitchell

26 Aug 26 Aug

10:26 a.m.

...

I was looking at the definitions of nub (and hence nubBy) recently in connection with a trac ticket and realised that this is O(n^2) complexity! Ouch!

Complexity theory plus the Eq context makes that inevitable. You can't have nub over Eq in anything less than O(n^2)

...

I was going to say I sure hope nobody uses this in real programs, but I thought I'd better check first and I see that darcs seems to use it in a few places. Hmm..

I use it all the time, its dead handy. There are 12 instances in Hoogle, for example. If profiling later shows it to be a problem, I'd fix it - but I can't ever actually remember that being the case. Premature optimisation is the root of all evil.

...

Seeing as practically all Eq instances are also Ord instances, at the very least we could have O(n*(log n)) definitions for ..

nub :: Ord a => [a] -> [a] nub = nubBy compare

nubBy :: (a -> a -> Ordering) -> [a] -> [a] nubBy cmp xs ys = -- something using an AVL tree perhaps.

..or even better using the generalised trie stuff Jamie Brandon has been working on.

Having nubOrd is a great idea, I even proposed it previously, but people disagreed with me. Propose it and add it by all means.

...

So could nub and nubBy be deprecated and an appropriate health warning added to the Haddock?

No. They should say O(n^2) in the haddock documentation, but O(n^2) /= useless. Thanks Neil

Bayley, Alistair

11:09 a.m.

...

From: libraries-bounces@haskell.org [mailto:libraries-bounces@haskell.org] On Behalf Of Neil Mitchell

Complexity theory plus the Eq context makes that inevitable. You can't have nub over Eq in anything less than O(n^2)

...
I was going to say I sure hope nobody uses this in real programs, but I thought I'd better check first and I see that darcs seems to use it in a few places. Hmm..

...
nub :: Ord a => [a] -> [a] nub = nubBy compare

Having nubOrd is a great idea, I even proposed it previously, but people disagreed with me. Propose it and add it by all means.

The name is... well, pessimal might be a bit strong, but few programmers would think to look for something called "nub". Personally, when I first looked for it I expected uniq or unique (because that's what the unix utility that does the same thing is called). Distinct (from SQL) is another name that occurred to me, but never nub... it's not even a synonym for unique: http://thesaurus.reference.com/browse/unique See the redefinition of nub as uniq here (which I assume is because John didn't know about nub): http://hackage.haskell.org/packages/archive/MissingH/1.0.0/doc/html/Data -List-Utils.html The folklore (such as it is) for uniq is that it is trivially defined like so (for lists):

...

uniq = map head . group . sort

and so perhaps is not worthy of library inclusion? BTW, is this a suitably performant definition, or would we benefit from a lower-level implementation? Alistair ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Neil Mitchell

11:13 a.m.

...

The folklore (such as it is) for uniq is that it is trivially defined like so (for lists):

...
uniq = map head . group . sort

and so perhaps is not worthy of library inclusion? BTW, is this a suitably performant definition, or would we benefit from a lower-level implementation?

A much better definition would use a Data.Set, then you get laziness and the order of elements is not permuted. Having a sortNub as well is a good idea though. Thanks Neil

Gwern Branwen

7:45 p.m.

On 2008.08.26 12:09:05 +0100, "Bayley, Alistair" scribbled 1.5K characters:

...

The name is... well, pessimal might be a bit strong, but few programmers would think to look for something called "nub". Personally, when I first looked for it I expected uniq or unique (because that's what the unix utility that does the same thing is called). Distinct (from SQL) is another name that occurred to me, but never nub... it's not even a synonym for unique: http://thesaurus.reference.com/browse/unique

See the redefinition of nub as uniq here (which I assume is because John didn't know about nub): http://hackage.haskell.org/packages/archive/MissingH/1.0.0/doc/html/Data -List-Utils.html

The folklore (such as it is) for uniq is that it is trivially defined like so (for lists):

...
uniq = map head . group . sort

and so perhaps is not worthy of library inclusion? BTW, is this a suitably performant definition, or would we benefit from a lower-level implementation?

Alistair

FWIW, when I tested out a number of nub/uniq definitions, the map head version performed the worst, much worse than the O(n^2) one everyone is complaining about. It's a neat definition, but not performant. -- gwern Iraq Hercules Bosnia Summercon Compsec 20 Albright EuroFed RDI encryption

Adrian Hey

9:52 p.m.

Gwern Branwen wrote:

...

FWIW, when I tested out a number of nub/uniq definitions, the map head version performed the worst, much worse than the O(n^2) one everyone is complaining about. It's a neat definition, but not performant.

When did you try this? IIRC correctly even the old sort was O(n^2), but someone had the sense to replace it a while ago. With ghci now on my machine.. length $ map head $ group $ sort [1..100000] finishes "instantly", but.. length $ nub [1..100000] takes about 90 seconds. Regards -- Adrian Hey

Duncan Coutts

11:56 p.m.

On Tue, 2008-08-26 at 22:52 +0100, Adrian Hey wrote:

...

Gwern Branwen wrote:

...
FWIW, when I tested out a number of nub/uniq definitions, the map head version performed the worst, much worse than the O(n^2) one everyone is complaining about. It's a neat definition, but not performant.

When did you try this? IIRC correctly even the old sort was O(n^2), but someone had the sense to replace it a while ago.

With ghci now on my machine.. length $ map head $ group $ sort [1..100000] finishes "instantly", but.. length $ nub [1..100000] takes about 90 seconds.

Also, sorting followed by grouping is unnecessary extra work. Sorts that discard duplicates are usually simple modifications of the sort algorithm. Though as people have pointed out, nub is nice because it is lazy, so sorting is out. An ord based nub should accumulate previously seen values so that it can operate lazily too. Duncan

Gwern Branwen

6 Sep 6 Sep

9:42 a.m.

On 2008.08.26 22:52:57 +0100, Adrian Hey scribbled 0.5K characters:

...

Gwern Branwen wrote:

...
FWIW, when I tested out a number of nub/uniq definitions, the map head version performed the worst, much worse than the O(n^2) one everyone is complaining about. It's a neat definition, but not performant.

When did you try this? IIRC correctly even the old sort was O(n^2), but someone had the sense to replace it a while ago.

With ghci now on my machine.. length $ map head $ group $ sort [1..100000] finishes "instantly", but.. length $ nub [1..100000] takes about 90 seconds.

Regards -- Adrian Hey

Sure, but suppose the list isn't nicely ascending? Suppose it's even repeated? Try out Prelude Data.List> length $ map head $ group $ sort $ take 10000000 $ repeat [1] versus Prelude> length $ nub $ take 10000000 $ repeat [1] Then the situation is the opposite, instant versus a long time. -- gwern Fetish Supreme detcord RFI Plame Yankee Kvashnin FDM Hitwords domestic

Bart Massey

10:10 p.m.

New subject: nubOrd (Re: Performance horrors)

IMHO the right way to think about nub is that it filters the given list with a "stop list". This code captures that intuition, and allows new kinds of nubs to be quickly constructed. Don't know if anyone wants it, but I thought I'd show it off, anyhow. nubWrt :: (StopList e s) => s -> [e] -> [e] nubWrt s [] = [] nubWrt s (e : es) | memberS e s = nubWrt s es nubWrt s (e : es) = e : nubWrt (insertS e s) es nub :: (Eq e) => [e] -> [e] nub = nubWrt [] nubOrd :: (Ord e) => [e] -> [e] nubOrd = nubWrt S.empty nubInt :: [Int] -> [Int] nubInt = nubWrt IS.empty I've omitted the StopList class and instances, which are of course crucial, but are kind of obvious. Get the whole code at http://svcs.cs.pdx.edu/gitweb?p=nub.git;f=Nub.hs;hb=master to see the details. A day of painful fiddling with how to do benchmark timing (there are three packages in Hackage and I couldn't figure out how to use any of them) tells me that this version of nub is about the same speed as the nub in Prelude (unsurprisingly). This nubOrd is capable of running a list of 10^7 elements, all different, in about 23 seconds on my box; nubInt runs that list in about 5 seconds. Plain old nub runs a list of 10^5 elements, all different, in about 70 seconds. For small lists, or lists where many of the elements are the same, everything's good: all the functions are fast and about the same speed. They all run lists of 10^8 identical elements in around 5s, although it turns out that the IntSet member test is close to twice as fast as list elem. I haven't QuickCheck tested anything, but my little tests with ghci make me think the code is probably correct. Hope this helps. Bart Massey bart@cs.pdx.edu

Aaron Denney

30 Aug 30 Aug

11:59 a.m.

On 2008-08-26, Bayley, Alistair wrote:

...

The name is... well, pessimal might be a bit strong, but few programmers would think to look for something called "nub". Personally, when I first looked for it I expected uniq or unique (because that's what the unix utility that does the same thing is called). Distinct (from SQL) is another name that occurred to me, but never nub... it's not even a synonym for unique: http://thesaurus.reference.com/browse/unique

Right. It helps a bit to remember it if you think of it as acting on the list, rather than the elements -- it finds the heart of a list, as removing duplicate points in an argument reduces it to its core. -- Aaron Denney -><-

Adrian Hey

26 Aug 26 Aug

5:32 p.m.

Neil Mitchell wrote:

...

There are 12 instances in Hoogle, for example. If profiling later shows it to be a problem, I'd fix it - but I can't ever actually remember that being the case. Premature optimisation is the root of all evil.

I'm sure most of the community would agree with you, but I have to say that if the consequence of this philosophy is horrors like this in the standard libs, it should come as no surprise that Haskell has a reputation for being "slow". What else is lurking there I wonder? What's really bad is that the terrible performance isn't even documented. It may be obvious, but it should still be documented. Has anybody even the remotest clue why darcs is (apparently) so slow? Maybe the community itself should share some of the blame for this. Like it wasn't obvious to me that the uses of nub I saw in darcs could rely on very short lists (<=1 element :-)

...

Having nubOrd is a great idea, I even proposed it previously, but people disagreed with me. Propose it and add it by all means.

Like I said, I'm not proposing it, as it doesn't seem to possible to implement it efficiently using anything else in the standard libs. You could do nubOrd (but not nubOrdBy) using Data.Set. But there are 2 problems.. 1- This not only introduces a cyclic dependency between modules, but also packages. I'm not sure how well various compilers and cabal would deal with this between them, but I'm not optimistic. 2- Data.Set is not obviously the best underlying implementation (in fact it is obviously not the best underlying implementation, this and Data.Map etc really should be pensioned off to hackage along with the rest of the badly documented, unreliable, inefficient and unstable junk :-) So I still think they should be deprecated. It seems like the least bad option if we can agree that their use should be strongly discouraged.

...

...
So could nub and nubBy be deprecated and an appropriate health warning added to the Haddock?

No. They should say O(n^2) in the haddock documentation, but O(n^2) /= useless.

I would say it is if there are many obvious O(n.(log n)) or better implementations that can be used in in 99%+ of cases. I mean so obvious that users can quite easily roll their own in 3 lines of code or less. Regards -- Adrian Hey

Neil Mitchell

6:11 p.m.

...

Has anybody even the remotest clue why darcs is (apparently) so slow? Maybe the community itself should share some of the blame for this. Like it wasn't obvious to me that the uses of nub I saw in darcs could rely on very short lists (<=1 element :-)

I'd be very surprised if it has anything to do with nub!

...

You could do nubOrd (but not nubOrdBy) using Data.Set.

You can do nubOrdBy, just use the data type: data Elem a = Elem (a -> a -> Ordering) a

...

1- This not only introduces a cyclic dependency between modules, but also packages. I'm not sure how well various compilers and cabal would deal with this between them, but I'm not optimistic.

Yep, that would be a pain.

...

2- Data.Set is not obviously the best underlying implementation (in fact it is obviously not the best underlying implementation, this and Data.Map etc really should be pensioned off to hackage along with the rest of the badly documented, unreliable, inefficient and unstable junk :-)

Data.Set is an interface, which you seem to think is not the fastest implementation of that interface. If you can expose the same interface, but improve the performance, and demonstrate that the performance is better, then go for it! I'd support such a library change proposal.

...

So I still think they should be deprecated. It seems like the least bad option if we can agree that their use should be strongly discouraged.

Nope. I love nub. It's a beautiful function, which does exactly what it says on the tin (albeit the tin is labelled in Latin).

...

I would say it is if there are many obvious O(n.(log n)) or better implementations that can be used in in 99%+ of cases. I mean so obvious that users can quite easily roll their own in 3 lines of code or less.

None of them are Eq a => [a] -> [a]. If designing things from scratch I'd say its fine to have a debate about whether you make nub require Eq or Ord, and then provide the other as an option. But that time has passed. Thanks Neil

Johan Tibell

27 Aug 27 Aug

12:57 p.m.

On Tue, Aug 26, 2008 at 11:11 AM, Neil Mitchell wrote:

...

...
2- Data.Set is not obviously the best underlying implementation (in fact it is obviously not the best underlying implementation, this and Data.Map etc really should be pensioned off to hackage along with the rest of the badly documented, unreliable, inefficient and unstable junk :-)

Data.Set is an interface, which you seem to think is not the fastest implementation of that interface. If you can expose the same interface, but improve the performance, and demonstrate that the performance is better, then go for it! I'd support such a library change proposal.

The problem with Data.Set and other collection types is that they're *not* defined as interfaces. If Data.Set was defined using a type class and maybe some associated data types it would be easier to provide an alternative implementation. Cheers, Johan

Ross Paterson

10:51 p.m.

On Tue, Aug 26, 2008 at 06:32:40PM +0100, Adrian Hey wrote:

...

2- Data.Set is not obviously the best underlying implementation (in fact it is obviously not the best underlying implementation, this and Data.Map etc really should be pensioned off to hackage along with the rest of the badly documented, unreliable, inefficient and unstable junk :-)

Do you have benchmarks to quantify how bad Data.Set and Data.Map are?

Don Stewart

10:53 p.m.

ross:

...

On Tue, Aug 26, 2008 at 06:32:40PM +0100, Adrian Hey wrote:

...
2- Data.Set is not obviously the best underlying implementation (in fact it is obviously not the best underlying implementation, this and Data.Map etc really should be pensioned off to hackage along with the rest of the badly documented, unreliable, inefficient and unstable junk :-)

Do you have benchmarks to quantify how bad Data.Set and Data.Map are?

Yes, a clear replacement for just Data.Map , with numbers, even if prototyped, would be rather compelling. And not bogged down in too much other code. Numbers please, or at least a direction to work in! -- Don

Neil Mitchell

11:05 p.m.

...

...
...
Data.Map etc really should be pensioned off to hackage along with the rest of the badly documented, unreliable, inefficient and unstable junk :-)

Do you have benchmarks to quantify how bad Data.Set and Data.Map are?

I'm also curious about the badly documented, unreliable or unstable comments too - if any of those are true (and in my experience they aren't) then they should also be fixed. Inefficient is the least of the problems on that list. Thanks Neil

Adrian Hey

29 Aug 29 Aug

6:24 a.m.

Don Stewart wrote:

...

ross:

...
On Tue, Aug 26, 2008 at 06:32:40PM +0100, Adrian Hey wrote:

...
2- Data.Set is not obviously the best underlying implementation (in fact it is obviously not the best underlying implementation, this and Data.Map etc really should be pensioned off to hackage along with the rest of the badly documented, unreliable, inefficient and unstable junk :-) Do you have benchmarks to quantify how bad Data.Set and Data.Map are?

Yes, a clear replacement for just Data.Map , with numbers, even if prototyped, would be rather compelling. And not bogged down in too much other code.

Numbers please, or at least a direction to work in!

Not sure what your asking for. I have Data.Map & Data.Set clones but decided not to publish them as I think folk should be migrating to Jamie Brandons GMap API (hopefully first release of that will be in Hackage soon). Jamie is also doing some benchmarking of various implementations of that API (one of them being AVL trees) and Data.Map. Regards -- Adrian Hey

Don Stewart

6:30 a.m.

ahey:

...

Don Stewart wrote:

...
ross:

...
On Tue, Aug 26, 2008 at 06:32:40PM +0100, Adrian Hey wrote:

...
2- Data.Set is not obviously the best underlying implementation (in fact it is obviously not the best underlying implementation, this and Data.Map etc really should be pensioned off to hackage along with the rest of the badly documented, unreliable, inefficient and unstable junk :-) Do you have benchmarks to quantify how bad Data.Set and Data.Map are?

Yes, a clear replacement for just Data.Map , with numbers, even if prototyped, would be rather compelling. And not bogged down in too much other code.

Numbers please, or at least a direction to work in!

Not sure what your asking for. I have Data.Map & Data.Set clones but decided not to publish them as I think folk should be migrating to Jamie Brandons GMap API (hopefully first release of that will be in Hackage soon). Jamie is also doing some benchmarking of various implementations of that API (one of them being AVL trees) and Data.Map.

Ok. That's good to know. Benchmark numbers would be great to see! -- Don

Adrian Hey

6:14 a.m.

Ross Paterson wrote:

...

On Tue, Aug 26, 2008 at 06:32:40PM +0100, Adrian Hey wrote:

...
2- Data.Set is not obviously the best underlying implementation (in fact it is obviously not the best underlying implementation, this and Data.Map etc really should be pensioned off to hackage along with the rest of the badly documented, unreliable, inefficient and unstable junk :-)

Do you have benchmarks to quantify how bad Data.Set and Data.Map are?

I was just trying to "yank Neils chain" a bit re. something he said a while ago. I don't think they're that bad really as long as you don't use union, intersection etc..(hedge algorithm seems bad). No, I don't have any recent benchmarks but I posted old ones a long time ago if you can find them. I can't :-) Regards -- Adrian Hey

Don Stewart

6:19 a.m.

ahey:

...

Ross Paterson wrote:

...
On Tue, Aug 26, 2008 at 06:32:40PM +0100, Adrian Hey wrote:

...
2- Data.Set is not obviously the best underlying implementation (in fact it is obviously not the best underlying implementation, this and Data.Map etc really should be pensioned off to hackage along with the rest of the badly documented, unreliable, inefficient and unstable junk :-)

Do you have benchmarks to quantify how bad Data.Set and Data.Map are?

I was just trying to "yank Neils chain" a bit re. something he said a while ago. I don't think they're that bad really as long as you don't use union, intersection etc..(hedge algorithm seems bad).

No, I don't have any recent benchmarks but I posted old ones a long time ago if you can find them. I can't :-)

Given all the different experiments over the last couple of years, could you perhaps summarise your thoughts on where we should be looking for the next great Data.Map? -- Don

Jeremy Shaw

26 Aug 26 Aug

7:24 p.m.

At Tue, 26 Aug 2008 08:38:45 +0100, Adrian Hey wrote:

...

I was looking at the definitions of nub (and hence nubBy) recently in connection with a trac ticket and realised that this is O(n^2) complexity! Ouch!

Can we modify Data.List to include the big-O notation for all the functions similar to Data.Set, Data.Map, and bytestring? j.

Adrian Hey

27 Aug 27 Aug

8:05 a.m.

Jeremy Shaw wrote:

...

At Tue, 26 Aug 2008 08:38:45 +0100, Adrian Hey wrote:

...
I was looking at the definitions of nub (and hence nubBy) recently in connection with a trac ticket and realised that this is O(n^2) complexity! Ouch!

Can we modify Data.List to include the big-O notation for all the functions similar to Data.Set, Data.Map, and bytestring?

I'm sure this is possible. I have no idea how to do this though. Perhaps somone else can explain. But patches submitted by ordinary users (even bug fixes) tend to languish in obscurity for some reason. I don't even know where the source code is or if we're using darcs or git or what these days (but I dare say I could find out if I was sufficiently motivated :-). Regards -- Adrian Hey

John Meacham

26 Aug 26 Aug

11 p.m.

nub has a couple advantages over the n log n version, the main one being that it only requires an 'Eq' constraint, not an 'Ord' one. another being that it is fully lazy, it produces results even for an infinite list. a third is that the results come out in the order they went in. That said, I have a 'snub' (sorted nub) routine I use pretty commonly as well defined in the standard way. If you have something like setnub xs = f Set.empty xs where f _ [] = [] f s (x:xs) = if x `Set.member` s then f s xs else f (Set.insert x xs) (x:xs) then you can use 'RULES' pragmas to replace nub with setnub when it is allowed. Though, I have reservations about using RULES to change the order of complexity. John -- John Meacham - ⑆repetae.net⑆john⑈

Adrian Hey

27 Aug 27 Aug

12:30 a.m.

John Meacham wrote:

...

nub has a couple advantages over the n log n version, the main one being that it only requires an 'Eq' constraint, not an 'Ord' one.

This is only an advantage for a tiny minority of types that have Eq but no Ord instances, such as TypRep. But there's no good reason why even this cannot have an Ord instance AFAICS (though not a derived one obviously).

...

another being that it is fully lazy, it produces results even for an infinite list.

As does the O(n*(log n)) AVL based nub I just wrote.

...

a third is that the results come out in the order they went in.

As does the O(n*(log n)) AVL based nub I just wrote :-) Regards -- Adrian Hey

David House

3:48 p.m.

2008/8/27 Adrian Hey :

...

As does the O(n*(log n)) AVL based nub I just wrote.

Care to share? WDYT about using RULES to rewrite to nubOrd if an Ord context is available, as John Meacham mentioned? John: you said you were uneasy about changing the complexity of an algorithm using RULES, but this is exactly what list fusion does (albeit for space, not time). -- -David

John Meacham

8 p.m.

On Wed, Aug 27, 2008 at 04:48:41PM +0100, David House wrote:

...

2008/8/27 Adrian Hey :

...
As does the O(n*(log n)) AVL based nub I just wrote.

Care to share?

WDYT about using RULES to rewrite to nubOrd if an Ord context is available, as John Meacham mentioned?

John: you said you were uneasy about changing the complexity of an algorithm using RULES, but this is exactly what list fusion does (albeit for space, not time).

Indeed. and I am a little uneasy about that too :) not that I don't think it is an excellent idea and reap the benefits of fast list operations.. O(n^2) to O(n log n) just feels like a bigger jump to me for some reason. I think in the end, the RULES are a good idea, but I personally try to make which one I am using explicit when possible. Hence making my ideas as to the time/space requirements for an operation to both the compiler and a future reader of the code. John -- John Meacham - ⑆repetae.net⑆john⑈

Christopher Lane Hinson

9:09 p.m.

...

...
WDYT about using RULES to rewrite to nubOrd if an Ord context is available, as John Meacham mentioned?

John: you said you were uneasy about changing the complexity of an algorithm using RULES, but this is exactly what list fusion does (albeit for space, not time).

Indeed. and I am a little uneasy about that too :) not that I don't think it is an excellent idea and reap the benefits of fast list operations.. O(n^2) to O(n log n) just feels like a bigger jump to me for some reason. I think in the end, the RULES are a good idea, but I personally try to make which one I am using explicit when possible. Hence making my ideas as to the time/space requirements for an operation to both the compiler and a future reader of the code

Using RULES in this way could be a pessimization. I can't imagine a nubOrd that would ever be as performant as nub on very small data sets (two or three elements). --Lane

Adrian Hey

9:23 p.m.

Christopher Lane Hinson wrote:

...

...
...
WDYT about using RULES to rewrite to nubOrd if an Ord context is available, as John Meacham mentioned?

John: you said you were uneasy about changing the complexity of an algorithm using RULES, but this is exactly what list fusion does (albeit for space, not time).

Indeed. and I am a little uneasy about that too :) not that I don't think it is an excellent idea and reap the benefits of fast list operations.. O(n^2) to O(n log n) just feels like a bigger jump to me for some reason. I think in the end, the RULES are a good idea, but I personally try to make which one I am using explicit when possible. Hence making my ideas as to the time/space requirements for an operation to both the compiler and a future reader of the code

Using RULES in this way could be a pessimization. I can't imagine a nubOrd that would ever be as performant as nub on very small data sets (two or three elements).

Why not? comparison doesn't cost any more than equality testing and I think we can safely say that nubOrd will never require *more* comparisons than equality tests needed by nub Regards -- Adrian Hey

Christopher Lane Hinson

28 Aug 28 Aug

1:38 a.m.

On Wed, 27 Aug 2008, Adrian Hey wrote:

...

...
Using RULES in this way could be a pessimization. I can't imagine a nubOrd that would ever be as performant as nub on very small data sets (two or three elements).

Why not? comparison doesn't cost any more than equality testing and I think we can safely say that nubOrd will never require *more* comparisons than equality tests needed by nub.

I think that you will have to do extra comparsons to build whatever ordered data structure you use, and extra work to keep it balanced or you will risk falling back on O(n^2) behavior, and feed the garbage collector a little bit more with discarded intermediate nodes. Whether this is a real problem or not is for empirical testing assuming anyone cares that much. I actually really like the RULES idea. There seemed to be an intuition that changing the complexity of a function using RULES is a bad idea, and I was putting forward a possible basis for that intution. A rule of thumb is that when you switch to an algorithm with a better complexity class, you pay at least a small price in the constant factor. --Lane

Brandon S. Allbery KF8NH

27 Aug 27 Aug

9:32 p.m.

On 2008 Aug 27, at 17:09, Christopher Lane Hinson wrote:

...

...
...
WDYT about using RULES to rewrite to nubOrd if an Ord context is available, as John Meacham mentioned?

John: you said you were uneasy about changing the complexity of an algorithm using RULES, but this is exactly what list fusion does (albeit for space, not time).

Indeed. and I am a little uneasy about that too :) not that I don't think it is an excellent idea and reap the benefits of fast list operations.. O(n^2) to O(n log n) just feels like a bigger jump to me

Using RULES in this way could be a pessimization. I can't imagine a nubOrd that would ever be as performant as nub on very small data sets (two or three elements).

I think if you're in a situation where the performance of nub on such a small dataset matters, you're already well into the realm of controlling everything manually to get performance. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Adrian Hey

29 Aug 29 Aug

6:07 a.m.

David House wrote:

...

2008/8/27 Adrian Hey :

...
As does the O(n*(log n)) AVL based nub I just wrote.

Care to share?

Picking up on this thread again..:-) I put an updated AvlTree package in hackage. A bit premature, but what the heck. There's no test in the test suite for it but it seemed to work fine with a few tests in ghci.

...

WDYT about using RULES to rewrite to nubOrd if an Ord context is available, as John Meacham mentioned?

Probably won't work with mine as I called it nub :-) Other than that it seems like a reasonable idea. Regards -- Adrian Hey

Cale Gibbard

6:56 a.m.

2008/8/26 Adrian Hey :

...

Seeing as practically all Eq instances are also Ord instances, at the very least we could have O(n*(log n)) definitions for ..

nub :: Ord a => [a] -> [a] nub = nubBy compare

nubBy :: (a -> a -> Ordering) -> [a] -> [a] nubBy cmp xs ys = -- something using an AVL tree perhaps.

While I agree that it would be handy to have Ord-specific versions of these, I'd just like to reiterate/expand on something which Neil mentioned: map head . group . sort has the additional effect of: 1) Demanding the entire input list before it can produce even a single element, and 2) Sorting the result rather than keeping things in the order they first occurred in the input. A correct implementation of nub which made use of Ord would maintain (say) a Data.Set of elements already seen, as it traversed the list lazily, producing elements in the output as soon as new elements were seen in the input, and no later. This of course guarantees that you return them in the order that they're first seen as well. This is still O(n log n), but reduces correctly to O(k log k) when only k elements of the input are needed to get the desired number of elements of the resulting list. Please don't make nub any stricter! On a side note related to the request for inclusion of complexities, since Haskell is a lazy language, we really ought to have complexities written in terms of the demanded portion of the result. Of course, Data.Set and Data.Map are (structure) strict, so it doesn't affect them so much, but it would certainly be nice for the functions in Data.List to know the answer to "if If the input is size n and I only demand k of the elements of the result, then what is the complexity?", especially for things like sort, where a lazy implementation can, for instance, make the head of the list available in just O(n) time. - Cale

Duncan Coutts

1:49 p.m.

On Fri, 2008-08-29 at 02:56 -0400, Cale Gibbard wrote:

...

On a side note related to the request for inclusion of complexities, since Haskell is a lazy language, we really ought to have complexities written in terms of the demanded portion of the result. Of course, Data.Set and Data.Map are (structure) strict, so it doesn't affect them so much, but it would certainly be nice for the functions in Data.List to know the answer to "if If the input is size n and I only demand k of the elements of the result, then what is the complexity?", especially for things like sort, where a lazy implementation can, for instance, make the head of the list available in just O(n) time.

Yeah, that's fairly important, though it's quite subtle. For example we'd probably say if we demand k elements of the result of map then it costs proportional to k (though we're not accounting for the cost of the function application to each element), but technically that's true for cons or tail too, even though they only do O(1) work and do not change the 'potential'/'cost' of the remainder of the list. Probably we can gloss over the details for a simple indication in the docs, but just for fun here's a spanner: foldr (\x xs -> x `seq` (x:xs)) [] it's only O(1) to eval whnf. It adds O(n) work to the whole structure, but that property doesn't distinguish it from foldr (\x xs -> (x:xs)) [] the difference of course is that it identifies costs within the structure. So when a function demands the spine it also has to pay the cost of the elements up front. Duncan

6147

Age (days ago)

6158

Last active (days ago)

List overview

Download

32 comments

16 participants

participants (16)

Aaron Denney
Adrian Hey
Bart Massey
Bayley, Alistair
Brandon S. Allbery KF8NH
Cale Gibbard
Christopher Lane Hinson
David House
Don Stewart
Duncan Coutts
Gwern Branwen
Jeremy Shaw
Johan Tibell
John Meacham
Neil Mitchell
Ross Paterson