Optimization with Strings ?

Hello In my futur program, it use a lot of binary trees with strings (words) as leaf. There is just arround 1000 words and they will appear a lot of times. The program will possibly consume a lot of process and memory (it is a mathematics proover). I began this program in C++ but haskell has a prety good speed and memory footprint and is easier. But I don't know if it worth to do this optimization: having a dictionary to translate string words in Int. The answer depends on the automatic optimizations in GHC, because GHC could compare quickely two strings if it is the same object, so it depends if program generated by GHC have a dictionary (tree) of strings internaly. Someone knows this ? (Sorry for my english, I'm french) thank you -- Emmanuel Chantréau

On Thu, Dec 3, 2009 at 1:03 PM, Emmanuel CHANTREAU
In my futur program, it use a lot of binary trees with strings (words) as leaf. There is just arround 1000 words and they will appear a lot of times. The program will possibly consume a lot of process and memory (it is a mathematics proover).
I began this program in C++ but haskell has a prety good speed and memory footprint and is easier. But I don't know if it worth to do this optimization: having a dictionary to translate string words in Int.
The answer depends on the automatic optimizations in GHC, because GHC could compare quickely two strings if it is the same object, so it depends if program generated by GHC have a dictionary (tree) of strings internaly. Someone knows this ?
It doesn't work this way : Strings are just lists of Chars. Comparison is made recursively, Char by Char. You can have a look at the source to make sure : instance (Eq a) => Eq [a] where [] == [] = True (x:xs) == (y:ys) = x == y && xs == ys _xs == _ys = False So you will have to code your own optimisation. David. P.S. In French if you didn't understand: Ca ne marche pas comme ça. Les chaines de caractères ne sont que des listes de caractères. La comparaison sur les listes est faite récursivement, caractère par caractère, il suffit pour s'en assurer de regarder au source : Donc il vaut mieux que tu implémente ton propre dictionnaire.

Le Thu, 3 Dec 2009 13:20:31 +0100,
David Virebayre
It doesn't work this way : Strings are just lists of Chars. Comparison is made recursively, Char by Char. You can have a look at the source to make sure :
instance (Eq a) => Eq [a] where [] == [] = True (x:xs) == (y:ys) = x == y && xs == ys _xs == _ys = False
Hello Thank you David and Bulat for your answers. I don't see the proof you see. Because GHC could store two sames objects juste once and by the definition of == on lists it could deduce that "forall x; List x => x==x". GHC have all informations to do this optimization job, because haskell functions definitions are mathematics definitions. Bulat says that this optimization is not done, so I will do it by my hands (ho my poor lazy hands). -- Emmanuel Chantréau

Emmanuel CHANTREAU wrote:
Le Thu, 3 Dec 2009 13:20:31 +0100, David Virebayre
a écrit : It doesn't work this way : Strings are just lists of Chars. Comparison is made recursively, Char by Char. You can have a look at the source to make sure :
instance (Eq a) => Eq [a] where [] == [] = True (x:xs) == (y:ys) = x == y && xs == ys _xs == _ys = False
Hello
Thank you David and Bulat for your answers.
I don't see the proof you see. Because GHC could store two sames objects juste once and by the definition of == on lists it could deduce that "forall x; List x => x==x". GHC have all informations to do this optimization job, because haskell functions definitions are mathematics definitions.
Besides any other reasons, Haskell has the error function, and infinite lists. Consider: p :: String p = error "Haha!" q :: String q = repeat 'a' pEqualsP :: Bool pEqualsP = p == p qEqualsQ :: Bool qEqualsQ = q == q By your rule, pEqualsP and qEqualsQ should be True. In fact, the correct answer is that pEqualsP should produce an error and qEqualsQ should never terminate. Since Strings can contain such errors and infinite lists, you can't know for certain that an object equals itself without checking its entire length, which is what the original definition for equals did anyway. There may be strict data structures for which your optimisation might be applicable, though. Neil.

Am Donnerstag 03 Dezember 2009 16:31:56 schrieb Neil Brown:
Emmanuel CHANTREAU wrote:
Le Thu, 3 Dec 2009 13:20:31 +0100,
David Virebayre
a écrit : It doesn't work this way : Strings are just lists of Chars. Comparison is made recursively, Char by Char. You can have a look at the source to make sure :
instance (Eq a) => Eq [a] where [] == [] = True (x:xs) == (y:ys) = x == y && xs == ys _xs == _ys = False
Hello
Thank you David and Bulat for your answers.
I don't see the proof you see. Because GHC could store two sames objects juste once and by the definition of == on lists it could deduce that "forall x; List x => x==x". GHC have all informations to do this optimization job, because haskell functions definitions are mathematics definitions.
Besides any other reasons, Haskell has the error function, and infinite lists. Consider:
p :: String p = error "Haha!"
q :: String q = repeat 'a'
pEqualsP :: Bool pEqualsP = p == p
qEqualsQ :: Bool qEqualsQ = q == q
By your rule, pEqualsP and qEqualsQ should be True. In fact, the correct answer is that pEqualsP should produce an error and qEqualsQ should never terminate. Since Strings can contain such errors and infinite lists, you can't know for certain that an object equals itself without checking its entire length, which is what the original definition for equals did anyway. There may be strict data structures for which your optimisation might be applicable, though.
Neil.
However, GHC offers a *really unsafe* function that allows to quickly check whether two values refer to the same heap-object. But it won't help Emmanuel, because any indirection causes it to say no and "let x = expression in x == x" should never appear in code anyway.

" In fact, the correct answer is that pEqualsP should produce an error and
qEqualsQ should never terminate"
¿¿???
should? or you want to say "actually do that so the optimization does is not
done"?
The correct amswer is not the sould you mention, but True (IMHO). So the
optimization can be done anyway.
2009/12/3 Neil Brown
Emmanuel CHANTREAU wrote:
Le Thu, 3 Dec 2009 13:20:31 +0100, David Virebayre
> a écrit : It doesn't work this way : Strings are just lists of Chars. Comparison is made recursively, Char by Char. You can have a look at the source to make sure :
instance (Eq a) => Eq [a] where [] == [] = True (x:xs) == (y:ys) = x == y && xs == ys _xs == _ys = False
Hello
Thank you David and Bulat for your answers.
I don't see the proof you see. Because GHC could store two sames objects juste once and by the definition of == on lists it could deduce that "forall x; List x => x==x". GHC have all informations to do this optimization job, because haskell functions definitions are mathematics definitions.
Besides any other reasons, Haskell has the error function, and infinite lists. Consider:
p :: String p = error "Haha!"
q :: String q = repeat 'a'
pEqualsP :: Bool pEqualsP = p == p
qEqualsQ :: Bool qEqualsQ = q == q
By your rule, pEqualsP and qEqualsQ should be True. In fact, the correct answer is that pEqualsP should produce an error and qEqualsQ should never terminate. Since Strings can contain such errors and infinite lists, you can't know for certain that an object equals itself without checking its entire length, which is what the original definition for equals did anyway. There may be strict data structures for which your optimisation might be applicable, though.
Neil.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Thu, Dec 3, 2009 at 12:21 PM, Alberto G. Corona
" In fact, the correct answer is that pEqualsP should produce an error and qEqualsQ should never terminate"
¿¿??? should? or you want to say "actually do that so the optimization does is not done"? The correct amswer is not the sould you mention, but True (IMHO). So the optimization can be done anyway.
No, the correct answer is actually _|_. Returning True would violate referential transparency: p = [1..] Now suppose p == p returned True. Then we can substitute any name for its value, by ref trans. So p == [1..] should also return True. This is reasonable. However, it is possible to prove (outside Haskell, but still rigoroously in terms of its semantics) that [1..] = fix (\xs -> 1 : map (+1) xs). So p == fix (1 : map (+1) xs) should also return True. This is getting unreasonable. Such an expectation would seem to require (==) to do a comprehensive proof search. Luke
2009/12/3 Neil Brown
Emmanuel CHANTREAU wrote:
Le Thu, 3 Dec 2009 13:20:31 +0100, David Virebayre
a écrit : It doesn't work this way : Strings are just lists of Chars. Comparison is made recursively, Char by Char. You can have a look at the source to make sure :
instance (Eq a) => Eq [a] where [] == [] = True (x:xs) == (y:ys) = x == y && xs == ys _xs == _ys = False
Hello
Thank you David and Bulat for your answers.
I don't see the proof you see. Because GHC could store two sames objects juste once and by the definition of == on lists it could deduce that "forall x; List x => x==x". GHC have all informations to do this optimization job, because haskell functions definitions are mathematics definitions.
Besides any other reasons, Haskell has the error function, and infinite lists. Consider:
p :: String p = error "Haha!"
q :: String q = repeat 'a'
pEqualsP :: Bool pEqualsP = p == p
qEqualsQ :: Bool qEqualsQ = q == q
By your rule, pEqualsP and qEqualsQ should be True. In fact, the correct answer is that pEqualsP should produce an error and qEqualsQ should never terminate. Since Strings can contain such errors and infinite lists, you can't know for certain that an object equals itself without checking its entire length, which is what the original definition for equals did anyway. There may be strict data structures for which your optimisation might be applicable, though.
Neil. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am Donnerstag, den 03.12.2009, 16:23 +0100 schrieb Emmanuel CHANTREAU:
Le Thu, 3 Dec 2009 13:20:31 +0100, David Virebayre
a écrit : It doesn't work this way : Strings are just lists of Chars. Comparison is made recursively, Char by Char. You can have a look at the source to make sure :
instance (Eq a) => Eq [a] where [] == [] = True (x:xs) == (y:ys) = x == y && xs == ys _xs == _ys = False
Hello
Thank you David and Bulat for your answers.
I don't see the proof you see. Because GHC could store two sames objects juste once and by the definition of == on lists it could deduce that "forall x; List x => x==x". GHC have all informations to do this optimization job, because haskell functions definitions are mathematics definitions.
This does not always hold, because the equivalence known from mathematics differs from Haskell's strict equality when it comes to infinite or undefined values. Consider let x = repeat () in x == x and let x = error "oops" in x == x In both cases your optimized program would return True, although the value is undefined (bottom).

Hello Emmanuel, Thursday, December 3, 2009, 6:23:56 PM, you wrote:
that "forall x; List x => x==x". GHC have all informations to do this optimization job, because haskell functions definitions are mathematics definitions.
GHC doesn't make ALL possible optimizations, isn't it obvious? ;) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Emmanuel, Thursday, December 3, 2009, 3:03:02 PM, you wrote:
memory footprint and is easier. But I don't know if it worth to do this optimization: having a dictionary to translate string words in Int.
GHC compiler already has this optimization. unfortunately it's not in the code it generates but in compiler itself (GHC is written in Haskell and compiled by itself). so it is definitely worth an implementation if you handle lots of strings as compiler does -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Emmanuel CHANTREAU on 2009-12-03 13:03:02 +0100:
In my futur program, it use a lot of binary trees with strings (words) as leaf. There is just arround 1000 words and they will appear a lot of times. The program will possibly consume a lot of process and memory (it is a mathematics proover).
I began this program in C++ but haskell has a prety good speed and memory footprint and is easier. But I don't know if it worth to do this optimization: having a dictionary to translate string words in Int.
The answer depends on the automatic optimizations in GHC, because GHC could compare quickely two strings if it is the same object, so it depends if program generated by GHC have a dictionary (tree) of strings internaly. Someone knows this ?
I don't know of a library to intern strings, but it's not too hard to implement. I couldn't find the code I wrote to do this, but I looked around a bit and this is about what I remember doing: http://www.haskell.org/pipermail/haskell-cafe/2005-June/010335.html For the application I was using, interning strings did provide a significant reduction in memory, but for whatever reason didn't help with speed.

On Thu, Dec 3, 2009 at 12:32 PM, Alec Berryman
Emmanuel CHANTREAU on 2009-12-03 13:03:02 +0100:
In my futur program, it use a lot of binary trees with strings (words) as leaf. There is just arround 1000 words and they will appear a lot of times. The program will possibly consume a lot of process and memory (it is a mathematics proover).
I began this program in C++ but haskell has a prety good speed and memory footprint and is easier. But I don't know if it worth to do this optimization: having a dictionary to translate string words in Int.
The answer depends on the automatic optimizations in GHC, because GHC could compare quickely two strings if it is the same object, so it depends if program generated by GHC have a dictionary (tree) of strings internaly. Someone knows this ?
I don't know of a library to intern strings, but it's not too hard to implement. I couldn't find the code I wrote to do this, but I looked around a bit and this is about what I remember doing:
http://www.haskell.org/pipermail/haskell-cafe/2005-June/010335.html
For the application I was using, interning strings did provide a significant reduction in memory, but for whatever reason didn't help with speed.
I'd use a trie. Edison provides Data.Edison.Assoc.TernaryTrie, and
there are a few other trie packages at hackage.
--
Dave Menendez

David Menendez wrote:
Alec Berryman wrote:
I don't know of a library to intern strings, but it's not too hard to implement. I couldn't find the code I wrote to do this, but I looked around a bit and this is about what I remember doing:
http://www.haskell.org/pipermail/haskell-cafe/2005-June/010335.html
For the application I was using, interning strings did provide a significant reduction in memory, but for whatever reason didn't help with speed.
I'd use a trie. Edison provides Data.Edison.Assoc.TernaryTrie, and there are a few other trie packages at hackage.
One of those: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-trie also uses ByteStrings instead of Strings. This will further reduce your memory footprint and will improve performance (because of cache consistency, less pointer chasing, and using C's fast string comparison function). I've been meaning to write a ByteString interning library on top of it, but haven't found the time just yet. -- Live well, ~wren

You might want to check out the "stringtable-atom" and "bytestring-trie" packages; these are the packages to which I turn when I want to see if I can speed up my code by using a different data structure to map String's to values. Cheers, Greg On Dec 3, 2009, at 4:03 AM, Emmanuel CHANTREAU wrote:
Hello
In my futur program, it use a lot of binary trees with strings (words) as leaf. There is just arround 1000 words and they will appear a lot of times. The program will possibly consume a lot of process and memory (it is a mathematics proover).
I began this program in C++ but haskell has a prety good speed and memory footprint and is easier. But I don't know if it worth to do this optimization: having a dictionary to translate string words in Int.
The answer depends on the automatic optimizations in GHC, because GHC could compare quickely two strings if it is the same object, so it depends if program generated by GHC have a dictionary (tree) of strings internaly. Someone knows this ?
(Sorry for my english, I'm french)
thank you
-- Emmanuel Chantréau _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Emmanuel CHANTREAU wrote:
In my futur program, it use a lot of binary trees with strings (words) as leaf. There is just arround 1000 words and they will appear a lot of times. The program will possibly consume a lot of process and memory (it is a mathematics proover).
If your strings are indeed leaves, then chances are that you share them a lot and there's nothing much to worry about. For instance, the following complete binary tree data Tree a = Leaf a | Branch (Tree a) (Tree a) example a = tree 100 where tree 0 = Leaf a tree n = let t = tree (n-1) in Branch t t uses only linear space and the argument a is stored exactly once. I'm not sure whether this answers your question, though, also because it depends on what exactly you want. I suggest writing a tiny prototype of your idea and asking on the mailing list again when there are any performance problems. Also, knowing Haskell's evaluation model helps a lot http://en.wikibooks.org/wiki/Haskell/Graph_reduction Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

On Thu, 2009-12-03 at 13:03 +0100, Emmanuel CHANTREAU wrote:
Hello
In my futur program, it use a lot of binary trees with strings (words) as leaf. There is just arround 1000 words and they will appear a lot of times. The program will possibly consume a lot of process and memory (it is a mathematics proover).
I began this program in C++ but haskell has a prety good speed and memory footprint and is easier. But I don't know if it worth to do this optimization: having a dictionary to translate string words in Int.
The answer depends on the automatic optimizations in GHC, because GHC could compare quickely two strings if it is the same object, so it depends if program generated by GHC have a dictionary (tree) of strings internaly. Someone knows this ?
There's nothing automatic about it. It depends on the implementation of the type of string you are using. For the String type there is no equality short-cut, for ByteString there is. Duncan
participants (14)
-
Alberto G. Corona
-
Alec Berryman
-
Bulat Ziganshin
-
Daniel Fischer
-
David Menendez
-
David Virebayre
-
Duncan Coutts
-
Emmanuel CHANTREAU
-
Gregory Crosswhite
-
Heinrich Apfelmus
-
Holger Siegel
-
Luke Palmer
-
Neil Brown
-
wren ng thornton