UPDATES: thanks a lot to Gabe and Glenn for the detailed explanation. The test is wrote not for language comparison benchmark, just for my studying on VM optimization technologies.
I did a simple test to understand the performance of string concatenation between Java and Python.
The test is target for the default immutable String object/type in both languages. So I don't use StringBuilder/StringBuffer in Java test.
The test simply adds strings for 100k times. Java consumes ~32 seconds to finish, while Python only uses ~13 seconds for Unicode string and 0.042 seconds for non Unicode string.
I'm a bit surprise about the results. I thought Java should be faster than Python. What optimization technology does Python leverage to achieve better performance? Or String object is designed too heavy in Java?
OS: Ubuntu 10.04 x64 JDK: Sun 1.6.0_21 Python: 2.6.5
Java test did use -Xms1024m to minimize GC activities.
Java code:
public class StringConcateTest {
public static void test(int n) {
long start = System.currentTimeMillis();
String a = "";
for (int i = 0; i < n; i++) {
a = a.concat(String.valueOf(i));
}
long end = System.currentTimeMillis();
System.out.println(a.length() + ", time:" + (end - start));
}
public static void main(String[] args) {
for (int i = 0; i < 10; i++) {
test(1000 * 100);
}
}
}
Python code:
import time
def f(n):
start = time.time()
a = u'' #remove u to use non Unicode string
for i in xrange(n):
a = a + str(i)
print len(a), 'time', (time.time() - start)*1000.0
for j in xrange(10):
f(1000 * 100)
@Gabe's answer is correct, but needs to be shown clearly rather than hypothesized.
CPython (and probably only CPython) does an in-place string append when it can. There are limitations on when it can do this.
First, it can't do it for interned strings. That's why you'll never see this if you test with a = "testing"; a = a + "testing"
, because assigning a string literal results in an interned string. You have to create the string dynamically, as this code does with str(12345)
. (This isn't much of a limitation; once you do an append this way once, the result is an uninterned string, so if you append string literals in a loop this will only happen the first time.)
Second, Python 2.x only does this for str
, not unicode
. Python 3.x does do this for Unicode strings. This is strange: it's a major performance difference--a difference in complexity. This discourages using Unicode strings in 2.x, when they should be encouraging it to help the transition to 3.x.
And finally, there can be no other references to the string.
>>> a = str(12345)
>>> id(a)
3082418720
>>> a += str(67890)
>>> id(a)
3082418720
This explains why the non-Unicode version is so much faster in your test than the Unicode version.
The actual code for this is string_concatenate
in Python/ceval.c
, and works for both s1 = s1 + s2
and s1 += s2
. The function _PyString_Resize
in Objects/stringobject.c
also says explicitly: The following function breaks the notion that strings are immutable. See also http://bugs.python.org/issue980695.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With