
Am 03.10.2016 um 01:20 schrieb Richard A. O'Keefe:
The Java *compiler* prefers StringBuilder: when you write a string concatenation expression in Java the compiler creates a StringBuilder behind the scenes. I'm counting a class as "preferred" if the compiler *has* to know about it and generates code involving it without the programmer explicitly mentioning it.
Then Haskell's preferred representation of additive types would be the updatable record. Or machine integers are preferably stored in registers because that's where every new integer is created, RAM is second class... I think that's stretching things too far. There are more indicators against your theory: 1) During the lifetime of a program, the vast majority of textual data is stored in String objects. StringBuilders are just temporary and are discarded once the String object is built. (That's quantitative, not qualitative.) 2) The compiler does NOT have to know. Straight from the Java spec:
15.18.1. [...] To increase the performance of repeated string concatenation, a Java compiler may use the StringBuffer class or a similar technique to reduce the number of intermediate String objects that are created by evaluation of an expression. Moreover, the entire paragraph is a non-authoritative remark.
Even then, Java has its preferred string representation nailed down pretty strongly: a hidden array of 16-bit Unicode code points, referenced by a descriptor object (the actual String), immutable.
As already noted, that representation changed internally.
Yes, Java 7 changed that to prevent memory leaks from happening.
And that change is actually relevant to this thread.
I have been thinking about that argument and do not think it is valid in a Java context. Java programmers are used to unexpected performance changes, mostly due to changes in the garbage collector. It's also just a single function that changed behaviour, and definitely not the most common one even if it's pretty important.
The representation that _used_ to be used was (char[] array, offset, length, hash) Amongst other things,
Not really...
this meant that taking a substring cost O(1) time and O(1) space, because you just had to allocate and initialise a new "descriptor object" sharing the underlying array.
"You" never had. This all happened behind the scenes, an implementation detail.
If you are working in a loop like while (there is more input) { read a chunk of input split it into substrings process some of the substrings } the pre-Java-1.7 representation is perfect. If you *retain* some of the substrings, however, you retain the whole chunk. That was easy to fix by doing retain(new String(someSubstring)) instead of retain(someSubstring) but you had to *know* to do it.
Okay, now i get the point. It's a pretty specialized kind of code though. Usually you don't care much about how much of some input you retain, because more than 50% of the input strings are retained anyway (if you even do retain strings). It did have the potential for a memory leak, but now we're getting into a pretty special corner case here. Plus it still does not change a bit about that String is the standard representation in Java, not StringBuffer nor byte[]. The programmer(!) isn't confused about selecting which one, and that was the point originally made. Diving into implementation details just to prove that wrong isn't going to change that the impression that Java's string representations are confusing was just the result of first impressions without actual practice.
(Another solution would be to have a smarter garbage collector that knew about string sharing and could compact strings. I wrote such a collector for XPL many years ago. It's quite easy to do a stop-and- copy garbage collector that does that. But that's not the state of the art in Java garbage collection,
Agreed.
and I'm not sure how well string compaction would fit into a more advanced collector.)
Since Java's standard use case is long-running server programs, most if not all Java GCs are copying collectors nowadays. So, this would be a good fit in principle. It might have unfavorable trade-offs with other use cases though. It's quite possible that they implemented this, benchmarked it, and found they couldn't get it up to competitive speed.
The point is that there is no one-size-fits-all string representation; being given only one forces you to either write your own additional representation(s) or to use a representation which is not really suited to your particular purpose.
I haven't read anybody complain about Java's string representation yet. That does not mean that nobody does (I'm pretty sure that there are complaints), it just doesn't concern people much in practice. Most Java programmers don't deal with this, they use a library like JAXML or Jackson for parsing (XML resp. JSON), get good-enough performance, and move on. Some people used to complain that 16-bit characters are a waste of memory, but even that isn't considered a big problem - essentially, the alternatives are out of sight and out of mind. (It would be interesting to see what happened in a language where the standard string representation is UTF-8. Given that Unicode requires a minimum of three bytes for a codepoint nowadays, the UTF-16 advantage of "character count = storage cell count" has vanished anyway.)