I asked the most efficient method for mass dynamic string concatenation in an earlier post and I was suggested to use the join method, the best, simplest and fastest method to do so (as everyone said that). But while I was playing with string concatenations, I found some weird(?) results. I'm sure something is going on but I can't not get it quite. Here is what I did: I defined these functions: <pre class="prettyprint"><code>import timeit def x(): s=[] for i in range(100): # Other codes here... s.append("abcdefg"[i%7]) return ''.join(s) def y(): s='' for i in range(100): # Other codes here... s+="abcdefg"[i%7] return s def z(): s='' for i in range(100): # Other codes here... s=s+"abcdefg"[i%7] return s def p(): s=[] for i in range(100): # Other codes here... s+="abcdefg"[i%7] return ''.join(s) def q(): s=[] for i in range(100): # Other codes here... s = s + ["abcdefg"[i%7]] return ''.join(s) </code></pre> I have tried to keep other things (except the concatenation) almost same throughout the functions. Then I tested with the following with results in comment (using Python 3.1.1 IDLE on Windows 32 bit machine): <pre class="prettyprint"><code>timeit.timeit(x) # 31.54912480500002 timeit.timeit(y) # 23.533029429999942 timeit.timeit(z) # 22.116181330000018 timeit.timeit(p) # 37.718607439999914 timeit.timeit(q) # 108.60377576499991 </code></pre> That means it shows that strng = strng + dyn_strng is the fastest. Though the difference in times are not that significant (except the last one), but I wanna know why this is happening. Is that because I am using Python 3.1.1 and that provides '+' as most efficient? Should I use '+' as an alternative to join? Or, have I done something extremely silly? Or what? Please explain clearly.

Some of us Python committers, I believe mostly Rigo and Hettinger, went out of their way (on the way to 2.5 I believe) to optimize some special cases of the alas-far-too-common <code>s += something</code> blight, arguing that it was proven that beginners will never be covinced that <code>''.join</code> is the right way to go and the horrible slowness of the <code>+=</code> might be giving Python a bad name. Others of us weren't that hot, because they just couldn't possibly optimize every occurrence (or even just a majority of them) to decent performance; but we didn't feel hotly enough on the issue to try and actively block them. I believe this thread proves we should have opposed them more sternly. As it is now, they optimized <code>+=</code> in a certain hard-to-predict subset of cases to where it can be maybe 20% faster for particular stupid cases than the proper way (which IS still <code>''.join</code>) -- just a perfect way to trap beginners into pursuing those irrelevant 20% gains by using the wrong idiom... at the cost, once in a while and from their POV out of the blue, of being hit with a performance loss of 200% (or more, since non-linear behavior IS still lurking there just outside of the corners that Hettinger and Rigo prettied up and put flowers in;-) -- one that MATTERS, one that WILL make them miserable. This goes against the grain of Python's "ideally only one obvious way to do it" and it feels to me like we, collectively, have lain a trap for beginners -- the best kind, too... those who don't just accept what they're told by their "betters", but inquisitively go and question and explore. Ah well -- I give up. OP, @mshsayem, go ahead, use += everywhere, enjoy your irrelevant 20% speedups in trivial, tiny, irrelevant cases, and you'd better enjoy them to the hilt -- because one day, when you can't see it coming, on an IMPORTANT, LARGE operation, you'll be hit smack in the midriff by the oncoming trailer truck of a 200% slowdown (unless you get unlucky and it's a 2000% one;-). Just remember: if you ever feel that "Python is horribly slow", REMEMBER, more likely than not it's one of your beloved loops of <code>+=</code> turning around and biting the hand that feeds it. For the rest of us -- those who understand what it means to say We should forget about small efficiencies, say about 97% of the time, I'll keep heartily recommending <code>''.join</code>, so we all can sleep in all tranquility and KNOW we won't be hit with a superlinear slowdown when we least expect and least can afford you. But for you, Armin Rigo, and Raymond Hettinger (the last two, dear personal friends of mine, BTW, not just co-commiters;-) -- may your <code>+=</code> be smooth and your big-O's never worse than N!-) So, for the rest of us, here's a more meaningful and interesting set of measurements: <pre class="prettyprint"><code>$ python -mtimeit -s'r=[str(x)*99 for x in xrange(100,1000)]' 's="".join(r)' 1000 loops, best of 3: 319 usec per loop </code></pre> 900 strings of 297 chars each, joining the list directly is of course fastest, but the OP is terrified about having to do appends before then. But: <pre class="prettyprint"><code>$ python -mtimeit -s'r=[str(x)*99 for x in xrange(100,1000)]' 's=""' 'for x in r: s+=x' 1000 loops, best of 3: 779 usec per loop $ python -mtimeit -s'r=[str(x)*99 for x in xrange(100,1000)]' 'z=[]' 'for x in r: z.append(x)' '"".join(z)' 1000 loops, best of 3: 538 usec per loop </code></pre> ...with a semi-important amount of data (a very few 100's of KB -- taking a measurable fraction of a millisecond every which way), even plain good old <code>.append</code> is alread superior. In addition, it's obviously and trivially easy to optimize: <pre class="prettyprint"><code>$ python -mtimeit -s'r=[str(x)*99 for x in xrange(100,1000)]' 'z=[]; zap=z.append' 'for x in r: zap(x)' '"".join(z)' 1000 loops, best of 3: 438 usec per loop </code></pre> shaving another tenths of a millisecond over the average looping time. Everybody (at least everybody who's totally obsessed abound performance) obviously knows that HOISTING (taking OUT of the inner loop a repetitive computation that would be otherwise performed over and over) is a crucial technique in optimization -- Python doesn't hoist on your behalf, so you have to do your own hoisting in those rare occasions where every microsecond matters.

As to why <code>q</code> is a lot slower: when you say <pre class="prettyprint"><code>l += "a" </code></pre> you are appending the string <code>"a"</code> to the end of <code>l</code>, but when you say <pre class="prettyprint"><code>l = l + ["a"] </code></pre> you are creating a new list with the contents of <code>l</code> and <code>["a"]</code> and then reassigning the results back to <code>l</code>. Thus new lists are constantly being generated.

I assume x() is slower because you're first building the array and then joining it. So you're not only measuring the time that join takes, but also the time that you take to build the array. In a scenario where you already have an array and you want to create a string out of its elements, join should be faster than iterating through the array and building the string step by step.

Python string 'join' is faster (?) than '+', but what's wrong here?

Tags:

performance

python

string

I asked the most efficient method for mass dynamic string concatenation in an earlier post and I was suggested to use the join method, the best, simplest and fastest method to do so (as everyone said that). But while I was playing with string concatenations, I found some weird(?) results. I'm sure something is going on but I can't not get it quite. Here is what I did:

I defined these functions:

import timeit
def x():
    s=[]
    for i in range(100):
        # Other codes here...
        s.append("abcdefg"[i%7])
    return ''.join(s)

def y():
    s=''
    for i in range(100):
        # Other codes here...
        s+="abcdefg"[i%7]
    return s

def z():
    s=''
    for i in range(100):
        # Other codes here...
        s=s+"abcdefg"[i%7]
    return s

def p():
    s=[]
    for i in range(100):
        # Other codes here...
        s+="abcdefg"[i%7]
    return ''.join(s)

def q():
    s=[]
    for i in range(100):
        # Other codes here...
        s = s + ["abcdefg"[i%7]]
    return ''.join(s)

I have tried to keep other things (except the concatenation) almost same throughout the functions. Then I tested with the following with results in comment (using Python 3.1.1 IDLE on Windows 32 bit machine):

timeit.timeit(x) # 31.54912480500002
timeit.timeit(y) # 23.533029429999942 
timeit.timeit(z) # 22.116181330000018
timeit.timeit(p) # 37.718607439999914
timeit.timeit(q) # 108.60377576499991

That means it shows that strng = strng + dyn_strng is the fastest. Though the difference in times are not that significant (except the last one), but I wanna know why this is happening. Is that because I am using Python 3.1.1 and that provides '+' as most efficient? Should I use '+' as an alternative to join? Or, have I done something extremely silly? Or what? Please explain clearly.

882

asked Aug 28 '09 20:08

mshsayem

3 Answers

Some of us Python committers, I believe mostly Rigo and Hettinger, went out of their way (on the way to 2.5 I believe) to optimize some special cases of the alas-far-too-common s += something blight, arguing that it was proven that beginners will never be covinced that ''.join is the right way to go and the horrible slowness of the += might be giving Python a bad name. Others of us weren't that hot, because they just couldn't possibly optimize every occurrence (or even just a majority of them) to decent performance; but we didn't feel hotly enough on the issue to try and actively block them.

I believe this thread proves we should have opposed them more sternly. As it is now, they optimized += in a certain hard-to-predict subset of cases to where it can be maybe 20% faster for particular stupid cases than the proper way (which IS still ''.join) -- just a perfect way to trap beginners into pursuing those irrelevant 20% gains by using the wrong idiom... at the cost, once in a while and from their POV out of the blue, of being hit with a performance loss of 200% (or more, since non-linear behavior IS still lurking there just outside of the corners that Hettinger and Rigo prettied up and put flowers in;-) -- one that MATTERS, one that WILL make them miserable. This goes against the grain of Python's "ideally only one obvious way to do it" and it feels to me like we, collectively, have lain a trap for beginners -- the best kind, too... those who don't just accept what they're told by their "betters", but inquisitively go and question and explore.

Ah well -- I give up. OP, @mshsayem, go ahead, use += everywhere, enjoy your irrelevant 20% speedups in trivial, tiny, irrelevant cases, and you'd better enjoy them to the hilt -- because one day, when you can't see it coming, on an IMPORTANT, LARGE operation, you'll be hit smack in the midriff by the oncoming trailer truck of a 200% slowdown (unless you get unlucky and it's a 2000% one;-). Just remember: if you ever feel that "Python is horribly slow", REMEMBER, more likely than not it's one of your beloved loops of += turning around and biting the hand that feeds it.

For the rest of us -- those who understand what it means to say We should forget about small efficiencies, say about 97% of the time, I'll keep heartily recommending ''.join, so we all can sleep in all tranquility and KNOW we won't be hit with a superlinear slowdown when we least expect and least can afford you. But for you, Armin Rigo, and Raymond Hettinger (the last two, dear personal friends of mine, BTW, not just co-commiters;-) -- may your += be smooth and your big-O's never worse than N!-)

So, for the rest of us, here's a more meaningful and interesting set of measurements:

$ python -mtimeit -s'r=[str(x)*99 for x in xrange(100,1000)]' 's="".join(r)' 1000 loops, best of 3: 319 usec per loop

900 strings of 297 chars each, joining the list directly is of course fastest, but the OP is terrified about having to do appends before then. But:

$ python -mtimeit -s'r=[str(x)*99 for x in xrange(100,1000)]' 's=""' 'for x in r: s+=x' 1000 loops, best of 3: 779 usec per loop $ python -mtimeit -s'r=[str(x)*99 for x in xrange(100,1000)]' 'z=[]' 'for x in r: z.append(x)' '"".join(z)' 1000 loops, best of 3: 538 usec per loop

...with a semi-important amount of data (a very few 100's of KB -- taking a measurable fraction of a millisecond every which way), even plain good old .append is alread superior. In addition, it's obviously and trivially easy to optimize:

$ python -mtimeit -s'r=[str(x)*99 for x in xrange(100,1000)]' 'z=[]; zap=z.append' 'for x in r: zap(x)' '"".join(z)' 1000 loops, best of 3: 438 usec per loop

shaving another tenths of a millisecond over the average looping time. Everybody (at least everybody who's totally obsessed abound performance) obviously knows that HOISTING (taking OUT of the inner loop a repetitive computation that would be otherwise performed over and over) is a crucial technique in optimization -- Python doesn't hoist on your behalf, so you have to do your own hoisting in those rare occasions where every microsecond matters.

answered Oct 05 '22 00:10

Alex Martelli

As to why q is a lot slower: when you say

l += "a"

you are appending the string "a" to the end of l, but when you say

l = l + ["a"]

you are creating a new list with the contents of l and ["a"] and then reassigning the results back to l. Thus new lists are constantly being generated.

answered Oct 05 '22 02:10

Kathy Van Stone

I assume x() is slower because you're first building the array and then joining it. So you're not only measuring the time that join takes, but also the time that you take to build the array.

In a scenario where you already have an array and you want to create a string out of its elements, join should be faster than iterating through the array and building the string step by step.

answered Oct 05 '22 02:10

sepp2k

Related questions
                            
                                How come a file doesn't get written until I stop the program?
                            
                                Change file type in PyCharm
                            
                                Store and access password using Apache airflow
                            
                                Find phone numbers in python script
                            
                                ImportError: cannot import name
                            
                                How can I remove the last character of a string in python? [duplicate]
                            
                                How do I conditionally insert an item into a dynamodb table using boto3
                            
                                flask.cli.NoAppException: Could not import "flaskr.flaskr"
                            
                                type=dict in argparse.add_argument()
                            
                                Rotate tick labels in subplot (Pyplot, Matplotlib, gridspec)
                            
                                Format of /dev/input/event*
                            
                                How can I format an integer to a two digit hex?
                            
                                Python error - "ImportError: cannot import name 'dist'"
                            
                                how to check whether list contains only None in python
                            
                                Numpy ValueError: setting an array element with a sequence. This message may appear without the existing of a sequence?
                            
                                Python check that key is defined in dictionary [duplicate]
                            
                                Error with igraph library - deprecated library
                            
                                "E: Unable to locate package python3-pip"
                            
                                How does Python's triple-quote string work?
                            
                                How do I rotate an image around its center using Pygame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With