Performance comparison of immutable string concatenation between Java and Python

Question

UPDATES: thanks a lot to Gabe and Glenn for the detailed explanation. The test is wrote not for language comparison benchmark, just for my studying on VM optimization technologies.

I did a simple test to understand the performance of string concatenation between Java and Python.

The test is target for the default immutable String object/type in both languages. So I don't use StringBuilder/StringBuffer in Java test.

The test simply adds strings for 100k times. Java consumes ~32 seconds to finish, while Python only uses ~13 seconds for Unicode string and 0.042 seconds for non Unicode string.

I'm a bit surprise about the results. I thought Java should be faster than Python. What optimization technology does Python leverage to achieve better performance? Or String object is designed too heavy in Java?

OS: Ubuntu 10.04 x64 JDK: Sun 1.6.0_21 Python: 2.6.5

Java test did use -Xms1024m to minimize GC activities.

Java code:

public class StringConcateTest {
public static void test(int n) {
    long start = System.currentTimeMillis();
    String a = "";
    for (int i = 0; i < n; i++) {
        a = a.concat(String.valueOf(i));
    }
    long end = System.currentTimeMillis();
    System.out.println(a.length() + ", time:" + (end - start));
}

public static void main(String[] args) {
    for (int i = 0; i < 10; i++) {
        test(1000 * 100);           
    }
}

}

Python code:

import time
def f(n):
    start = time.time()
    a = u'' #remove u to use non Unicode string
    for i in xrange(n):
        a = a + str(i)
    print len(a), 'time', (time.time() - start)*1000.0
for j in xrange(10):
    f(1000 * 100)

Glenn Maynard · Accepted Answer

@Gabe's answer is correct, but needs to be shown clearly rather than hypothesized.

CPython (and probably only CPython) does an in-place string append when it can. There are limitations on when it can do this.

First, it can't do it for interned strings. That's why you'll never see this if you test with a = "testing"; a = a + "testing", because assigning a string literal results in an interned string. You have to create the string dynamically, as this code does with str(12345). (This isn't much of a limitation; once you do an append this way once, the result is an uninterned string, so if you append string literals in a loop this will only happen the first time.)

Second, Python 2.x only does this for str, not unicode. Python 3.x does do this for Unicode strings. This is strange: it's a major performance difference--a difference in complexity. This discourages using Unicode strings in 2.x, when they should be encouraging it to help the transition to 3.x.

And finally, there can be no other references to the string.

>>> a = str(12345)
>>> id(a)
3082418720
>>> a += str(67890)
>>> id(a)
3082418720

This explains why the non-Unicode version is so much faster in your test than the Unicode version.

The actual code for this is string_concatenate in Python/ceval.c, and works for both s1 = s1 + s2 and s1 += s2. The function _PyString_Resize in Objects/stringobject.c also says explicitly: The following function breaks the notion that strings are immutable. See also http://bugs.python.org/issue980695.

Performance comparison of immutable string concatenation between Java and Python

Tags:

java

performance

python

string

concatenation

Cuper Hector

1 Answers

Glenn Maynard

Recent Activity

Donate For Us

Performance comparison of immutable string concatenation between Java and Python

Tags:

java

performance

python

string

concatenation

Cuper Hector

1 Answers

Glenn Maynard

Related questions

Recent Activity

Donate For Us