Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python string concatenation Idiom. Need Clarification.

From http://jaynes.colorado.edu/PythonIdioms.html

"Build strings as a list and use ''.join at the end. join is a string method called on the separator, not the list. Calling it from the empty string concatenates the pieces with no separator, which is a Python quirk and rather surprising at first. This is important: string building with + is quadratic time instead of linear! If you learn one idiom, learn this one.

Wrong: for s in strings: result += s

Right: result = ''.join(strings)"

I'm not sure why this is true. If I have some strings I want to join them, for me it isn't intuitively better to me to put them in a list then call ''.join. Doesn't putting them into a list create some overhead? To Clarify...

Python Command Line:

>>> str1 = 'Not'
>>> str2 = 'Cool'
>>> str3 = ''.join([str1, ' ', str2]) #The more efficient way **A**
>>> print str3
Not Cool
>>> str3 = str1 + ' ' + str2 #The bad way **B**
>>> print str3
Not Cool

Is A really linear time and B is quadratic time?

like image 455
Derek Litz Avatar asked Nov 29 '22 04:11

Derek Litz


1 Answers

Yes. For the examples you chose the importance isn't clear because you only have two very short strings so the append would probably be faster.

But every time you do a + b with strings in Python it causes a new allocation and then copies all the bytes from a and b into the new string. If you do this in a loop with lots of strings these bytes have to be copied again, and again, and again and each time the amount that has to be copied gets longer. This gives the quadratic behaviour.

On the other hand, creating a list of strings doesn't copy the contents of the strings - it just copies the references. This is incredibly fast, and runs in linear time. The join method then makes just one memory allocation and copies each string into the correct position only once. This also takes only linear time.

So yes, do use the ''.join idiom if you are potentially dealing with a large number of strings. For just two strings it doesn't matter.

If you need more convincing, try it for yourself creating a string from 10M characters:

>>> chars = ['a'] * 10000000
>>> r = ''
>>> for c in chars: r += c
>>> print len(r)

Compared with:

>>> chars = ['a'] * 10000000
>>> r = ''.join(chars)
>>> print len(r)

The first method takes about 10 seconds. The second takes under 1 second.

like image 61
Mark Byers Avatar answered Dec 10 '22 05:12

Mark Byers