
Ketil Malde wrote:
[LOC vs gz as a program complexity metric]
Do either of those make sense as a "program /complexity/ metric"? Seems to me that's reading a lot more into those measurements than we should. It's slightly interesting that, while we're happily opining about LOCs and gz, no one has even tried to show that switching from LOCs to gz made a big difference in those "program bulk" rankings, or even provided a specific example that they feel shows how gz is misrepresentative - all opinion, no data. (Incidentally LOC measures source code "shape" as much as anything else - programs in statement heavy languages tend to be longer and thinner, and expression heavy languages tend to be shorter and wider.) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

On Friday 02 November 2007 19:03, Isaac Gouy wrote:
It's slightly interesting that, while we're happily opining about LOCs and gz, no one has even tried to show that switching from LOCs to gz made a big difference in those "program bulk" rankings, or even provided a specific example that they feel shows how gz is misrepresentative - all opinion, no data.
Why gzip and not run-length encoding, Huffman coding, arithmetic coding, block sorting, PPM etc.? Choosing gzip is completely subjective and there is no logical reason to think that gzipped byte count reflects anything of interest. Why waste any time studying results in such an insanely stupid metric? Best case you'll end up concluding that the added complexity had no adverse effect on the results. In contrast, LOC has obvious objective merits: it reflects the amount of code the developer wrote and the amount of code the developer can see whilst reading code. -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?e

--- Jon Harrop
On Friday 02 November 2007 19:03, Isaac Gouy wrote:
It's slightly interesting that, while we're happily opining about LOCs and gz, no one has even tried to show that switching from LOCs to gz made a big difference in those "program bulk" rankings, or even provided a specific example that they feel shows how gz is misrepresentative - all opinion, no data.
Why gzip and not run-length encoding, Huffman coding, arithmetic coding, block sorting, PPM etc.?
Choosing gzip is completely subjective and there is no logical reason to think that gzipped byte count reflects anything of interest. Why waste any time studying results in such an insanely stupid metric? Best case you'll end up concluding that the added complexity had no adverse effect on the results.
In contrast, LOC has obvious objective merits: it reflects the amount of code the developer wrote and the amount of code the developer can see whilst reading code.
How strange that you've snipped out the source code shape comment that would undermine what you say - obviously LOC doesn't tell you anything about how much stuff is on each line, so it doesn't tell you about the amount of code that was written or the amount of code the developer can see whilst reading code. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

On 11/2/07, Isaac Gouy
How strange that you've snipped out the source code shape comment that would undermine what you say - obviously LOC doesn't tell you anything about how much stuff is on each line, so it doesn't tell you about the amount of code that was written or the amount of code the developer can see whilst reading code.
It still tells you how much content you can see on a given amount of vertical space. I think the point, however, is that while LOC is not perfect, gzip is worse. It's completely arbitrary and favours languages wich requires you to write tons of book keeping (semantic noise) as it will compress down all that redundancy quite a bit (while the programmer would still has to write it, and maintain it). So gzip is even less useful than LOC, as it actively *hides* the very thing you're trying to meassure! You might as well remove it alltogether. Or, as has been suggested, count the number of words in the program. Again, not perfect (it's possible in some languages to write things which has no whitespace, but is still lots of tokens). -- Sebastian Sylvan +44(0)7857-300802 UIN: 44640862

--- Sebastian Sylvan
It still tells you how much content you can see on a given amount of vertical space.
And why would we care about that? :-)
I think the point, however, is that while LOC is not perfect, gzip is worse.
How do you know?
Best case you'll end up concluding that the added complexity had no adverse effect on the results.
Best case would be seeing that the results were corrected against bias in favour of long-lines, and ranked programs in a way that looks-right when we look at the program source code side-by-side.
It's completely arbitrary and favours languages wich requires you to write tons of book keeping (semantic noise) as it will compress down all that redundancy quite a bit (while the programmer would still has to write it, and maintain it). So gzip is even less useful than LOC, as it actively *hides* the very thing you're trying to meassure! You might as well remove it alltogether.
I don't think you've looked at any of the gz rankings, or compared the source code for any of the programs :-)
Or, as has been suggested, count the number of words in the program. Again, not perfect (it's possible in some languages to write things which has no whitespace, but is still lots of tokens).
Wouldn't that be "completely arbitrary"? __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

igouy2:
--- Sebastian Sylvan
wrote: -snip- It still tells you how much content you can see on a given amount of vertical space.
And why would we care about that? :-)
I think the point, however, is that while LOC is not perfect, gzip is worse.
How do you know?
Best case you'll end up concluding that the added complexity had no adverse effect on the results.
Best case would be seeing that the results were corrected against bias in favour of long-lines, and ranked programs in a way that looks-right when we look at the program source code side-by-side.
It's completely arbitrary and favours languages wich requires you to write tons of book keeping (semantic noise) as it will compress down all that redundancy quite a bit (while the programmer would still has to write it, and maintain it). So gzip is even less useful than LOC, as it actively *hides* the very thing you're trying to meassure! You might as well remove it alltogether.
I don't think you've looked at any of the gz rankings, or compared the source code for any of the programs :-)
Or, as has been suggested, count the number of words in the program. Again, not perfect (it's possible in some languages to write things which has no whitespace, but is still lots of tokens).
Wouldn't that be "completely arbitrary"?
I follow the shootout changes fairly often, and the gzip change didn't significantly alter the rankings, though iirc, it did cause perl to drop a few places. Really, its a fine heuristic, given its power/weight ratio. -- Don

while LOC is not perfect, gzip is worse. the gzip change didn't significantly alter the rankings
Currently the gzip ratio of C++ to Python is 2.0, which at a glance, wouldn't sell me on a "less code" argument. Although the rank stayed the same, did the change reduce the magnitude of the victory? Thanks, Greg

--- Greg Fitzgerald
while LOC is not perfect, gzip is worse. the gzip change didn't significantly alter the rankings
Currently the gzip ratio of C++ to Python is 2.0, which at a glance, wouldn't sell me on a "less code" argument.
a) you're looking at an average, instead try http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=python&lang2=gpp b) we're not trying to sell you on a "less code" argument - it's whatever it is
Although the rank stayed the same, did the change reduce the magnitude of the victory?
c) that will have varied program to program, and do you care which way "the magnitude of victory" moved or do you care that where it moved to makes more sense? For fun, 2 meteor-contest programs, ratios to the python-2 program LOC GZ WC ghc-3 0.98 1.40 1.51 gpp-4 3.76 4.14 4.22 Look at the python-2 and ghc-3 source and tell us if LOC gave a reasonable indication of relative program size - is ghc-3 really the smaller program? :-) http://shootout.alioth.debian.org/gp4/benchmark.php?test=meteor&lang=all&sort=gz __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

As I understand it, the question is what you want to measure for. gzip is actually pretty good at, precisely because it removes boilerplate, reducing programs to something approximating their complexity. So a higher gzipped size means, at some level, a more complicated algorithm (in the case, maybe, of lower level languages, because there's complexity that's not lifted to the compiler). LOC per language, as I understand it, has been somewhat called into question as a measure of productivity, but there's still a correlation between programmers and LOC across languages even if it wasn't as strong as thought -- on the other hand, bugs per LOC seems to have been fairly strongly debunked as something constant across languages. If you want a measure of the language as a language, I guess LOC/gzipped is a good ratio for how much "noise" it introduces -- but if you want to measure just pure speed across similar algorithmic implementations, which, as I understand it, is what the shootout is all about, then gzipped actually tends to make some sense. --S

On 11/2/07, Sterling Clover
As I understand it, the question is what you want to measure for. gzip is actually pretty good at, precisely because it removes boilerplate, reducing programs to something approximating their complexity. So a higher gzipped size means, at some level, a more complicated algorithm (in the case, maybe, of lower level languages, because there's complexity that's not lifted to the compiler). LOC per language, as I understand it, has been somewhat called into question as a measure of productivity, but there's still a correlation between programmers and LOC across languages even if it wasn't as strong as thought -- on the other hand, bugs per LOC seems to have been fairly strongly debunked as something constant across languages. If you want a measure of the language as a language, I guess LOC/gzipped is a good ratio for how much "noise" it introduces -- but if you want to measure just pure speed across similar algorithmic implementations, which, as I understand it, is what the shootout is all about, then gzipped actually tends to make some sense.
--S _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Lossless File compression, AKA entropy coding, attempts to maximize the amount of information per bit (or byte) to be as close to the entropy as possible. Basically, gzip is measuring (approximating) the amount of "information" contained in the code. I think it would be interesting to compare the ratios between raw file size its entropy (we can come up with a precise metric later). This would show us how concise the language and code actually is. --ryan

On Nov 3, 2007 5:00 AM, Ryan Dickie
Lossless File compression, AKA entropy coding, attempts to maximize the amount of information per bit (or byte) to be as close to the entropy as possible. Basically, gzip is measuring (approximating) the amount of "information" contained in the code.
Hmmm, interesting idea.
I think it would be interesting to compare the ratios between raw file size its entropy (we can come up with a precise metric later). This would show us how concise the language and code actually is.
Yeah, let's all write in bytecode using a hex editor :-D

On Friday 02 November 2007 23:53, Isaac Gouy wrote:
Best case you'll end up concluding that the added complexity had no adverse effect on the results.
Best case would be seeing that the results were corrected against bias in favour of long-lines, and ranked programs in a way that looks-right when we look at the program source code side-by-side.
Why would you want to subjectively "correct" for "bias" in favour of long lines?
Or, as has been suggested, count the number of words in the program. Again, not perfect (it's possible in some languages to write things which has no whitespace, but is still lots of tokens).
Wouldn't that be "completely arbitrary"?
That is not an argument in favour of needlessly adding extra complexity and adopting a practically-irrelevant metric. Why not use the byte count of a PNG encoding of a photograph of the source code written out by hand in blue ballpoint pen? -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?e

On Friday 02 November 2007 20:29, Isaac Gouy wrote:
...obviously LOC doesn't tell you anything about how much stuff is on each line, so it doesn't tell you about the amount of code that was written or the amount of code the developer can see whilst reading code.
Code is almost ubiquitously visualized as a long vertical strip. The width is limited by your screen. Code is then read by scrolling vertically. This is why LOC is a relevant measure: because the area of the code is given by LOC * screen width and is largely unrelated to the subjective "amount of stuff on each line". As you say, imperative languages like C are often formatted such that a lot of right-hand screen real estate is wasted. LOC penalizes such wastage. The same cannot be said for gzipped bytes, which is an entirely irrelevant metric... -- Dr Jon D Harrop, Flying Frog Consultancy Ltd. http://www.ffconsultancy.com/products/?e

On 11/2/07, Isaac Gouy
Ketil Malde wrote:
[LOC vs gz as a program complexity metric]
Do either of those make sense as a "program /complexity/ metric"?
You're right! We should be using Kolmogorov complexity instead! I'll go write a program to calculate it for the shootout. Oh wait... Luke

Isaac Gouy
Ketil Malde wrote:
[LOC vs gz as a program complexity metric]
Do either of those make sense as a "program /complexity/ metric"?
Sorry, bad choice of words on my part. -k -- If I haven't seen further, it is by standing in the footprints of giants
participants (10)
-
Don Stewart
-
Greg Fitzgerald
-
Hugh Perkins
-
Isaac Gouy
-
Jon Harrop
-
Ketil Malde
-
Luke Palmer
-
Ryan Dickie
-
Sebastian Sylvan
-
Sterling Clover