
I have a list of text strings: ["Alice", "Bob", "Cindy", "Bob", "Bob", "Dave", "Cindy"] As you can see, some of the strings occur only once; others appear two or more times. I would like to end up with a new list, according to the following rules: 1) If a string occurs only once in the original list, then that string appears unchanged in the new list. 2) If a string occurs more than once in the original list, then each occurrences of that string is "decorated" with a numerical suffix indicating its relative position among its peers. In the case of the above list, the result of applying these rules would be: ["Alice", "Bob:1", "Cindy:1", "Bob:2", "Bob:3", "Dave", "Cindy:2"] So, how do we do this? The straightforward approach is to traverse the list twice, once to build a table of repeat counts for the different strings, and a second time to generate the new list, based on the old list plus the table of repeat counts. Of course, it's possible to streamline this into a single traversal by building up a list of thunks at the same time as the table of repeat counts is generated, and then to resolve those thunks by applying them in one fell swoop to the table. But while this does _conceptually_ streamline the problem, it doesn't offer any practical improvement; it takes at least as much processing effort to resolve the thunks as it would to make a second pass through the list. Can we do better? I can't think of a way to do better, and my gut feeling is that there isn't one (in general, it's impossible to know whether the first element of the new list will be "Alice" or "Alice:1" until the entire original list has been traversed), but I thought I'd put the question out to the community to see if anyone had any brilliant insights. Steve Schafer Fenestra Technologies Corp. http://www.fenestra.com/

I have a list of text strings:
["Alice", "Bob", "Cindy", "Bob", "Bob", "Dave", "Cindy"]
As you can see, some of the strings occur only once; others appear two or more times.
I would like to end up with a new list, according to the following rules:
1) If a string occurs only once in the original list, then that string appears unchanged in the new list.
2) If a string occurs more than once in the original list, then each occurrences of that string is "decorated" with a numerical suffix indicating its relative position among its peers.
In the case of the above list, the result of applying these rules would be:
["Alice", "Bob:1", "Cindy:1", "Bob:2", "Bob:3", "Dave", "Cindy:2"]
So, how do we do this? The straightforward approach is to traverse the list twice, once to build a table of repeat counts for the different strings, and a second time to generate the new list, based on the old list plus the table of repeat counts. It seems you only need a table of counts of the number of times
Steve Schafer wrote: the string has occurred previously, which you can have available on the first pass. e.g, import Data.List import Data.Map as Map mark strings = snd $ mapAccumL (\counts str -> (Map.insertWith (+) str 1 counts, str++maybe "" (\n -> ':':show n) (Map.lookup str counts))) Map.empty strings this works on your example, and also handles ["Alice", "Bob", "Cindy", "Bob", "Bob", "Dave", "Cindy"] ++ repeat "Tim" giving ["Alice","Bob","Cindy","Bob:1","Bob:2","Dave","Cindy:1","Tim","Tim:1","Tim:2",...]
Of course, it's possible to streamline this into a single traversal by building up a list of thunks at the same time as the table of repeat counts is generated, and then to resolve those thunks by applying them in one fell swoop to the table. But while this does _conceptually_ streamline the problem, it doesn't offer any practical improvement; it takes at least as much processing effort to resolve the thunks as it would to make a second pass through the list.
If your result data structure is sufficiently lazy, you can start forcing some thunks before the list has been completely processed. For example, suppose you want to instead what to tag all but the last string in a list with an asterisk, like ["Alice", "Bob", "Cindy", "Bob", "Bob", "Dave", "Cindy"] ==> ["Alice", "Bob*", "Cindy*", "Bob*", "Bob", "Dave", "Cindy"] You can start to get results before the list has been processed, if you have a set implementation with the laziness property member x set --> Set.member x (set `union` undefined) (you can get this if you use a trie) Then you can use define something like tag = foldr (\(x (xs, xsAsSet) -> ((x++if member x xsAsSet then "*" else ""):xs, insert x xsAsSet)) empty Then, you can force the results of examples like repeat "Ted" before the (infinite) list has been completely scanned, as long as no value is the last. Maybe there's a more interesting example? Brandon

On Thu, 02 Nov 2006 12:20:55 -0800, you wrote:
It seems you only need a table of counts of the number of times the string has occurred previously, which you can have available on the first pass.
e.g, import Data.List import Data.Map as Map mark strings = snd $ mapAccumL (\counts str -> (Map.insertWith (+) str 1 counts, str++maybe "" (\n -> ':':show n) (Map.lookup str counts))) Map.empty strings
this works on your example, and also handles
["Alice", "Bob", "Cindy", "Bob", "Bob", "Dave", "Cindy"] ++ repeat "Tim" giving
["Alice","Bob","Cindy","Bob:1","Bob:2","Dave","Cindy:1","Tim","Tim:1","Tim:2",...]
No, you missed a part of the second rule: If there are multiple occurrences of a string, the _first_ occurrence has to be tagged, too, not just the second through n'th occurrences. The result of the above would have to be: ["Alice","Bob:1","Cindy:1","Bob:2","Bob:3","Dave","Cindy:2","Tim:1","Tim:2","Tim:3",...] Steve Schafer Fenestra Technologies Corp. http://www.fenestra.com/

Steve Schafer wrote:
I have a list of text strings:
["Alice", "Bob", "Cindy", "Bob", "Bob", "Dave", "Cindy"]
As you can see, some of the strings occur only once; others appear two or more times.
I would like to end up with a new list, according to the following rules:
1) If a string occurs only once in the original list, then that string appears unchanged in the new list.
2) If a string occurs more than once in the original list, then each occurrences of that string is "decorated" with a numerical suffix indicating its relative position among its peers.
In the case of the above list, the result of applying these rules would be:
["Alice", "Bob:1", "Cindy:1", "Bob:2", "Bob:3", "Dave", "Cindy:2"]
So, how do we do this? The straightforward approach is to traverse the list twice, once to build a table of repeat counts for the different strings, and a second time to generate the new list, based on the old list plus the table of repeat counts. Of course, it's possible to streamline this into a single traversal by building up a list of thunks at the same time as the table of repeat counts is generated, and then to resolve those thunks by applying them in one fell swoop to the table. But while this does _conceptually_ streamline the problem, it doesn't offer any practical improvement; it takes at least as much processing effort to resolve the thunks as it would to make a second pass through the list.
The processing effort is a matter of optimization. It is entirely possible that for small lists that fit in the CPU cache that a double traversal is very cheap, where for a large lists that spills into swap space on disk that a second traversal is very expensive. So the obvious thing to do is just write the easiest to maintain version and if the performance benchmarks are bad then try the other method.
Can we do better?
no.
participants (3)
-
Brandon Moore
-
Chris Kuklewicz
-
Steve Schafer