Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is 'new_file += line + string' so much faster than 'new_file = new_file + line + string'? [duplicate]

Our code takes 10 minutes to siphon thru 68,000 records when we use:

new_file = new_file + line + string

However when we do the following it takes just 1 second:

new_file += line + string

Here is the code:

for line in content:
import time
import cmdbre

fname = "STAGE050.csv"
regions = cmdbre.regions
start_time = time.time()
with open(fname) as f:
        content = f.readlines()
        new_file_content = ""
        new_file = open("CMDB_STAGE060.csv", "w")
        row_region = ""
        i = 0
        for line in content:
                if (i==0):
                        new_file_content = line.strip() + "~region" + "\n"
                else:
                        country = line.split("~")[13]
                        try:
                                row_region = regions[country]
                        except KeyError:
                                row_region = "Undetermined"
                        new_file_content += line.strip() + "~" + row_region + "\n"
                print (row_region)
                i = i + 1
        new_file.write(new_file_content)
        new_file.close()
        end_time = time.time()
        print("total time: " + str(end_time - start_time))

All code I've ever written in python uses the first option. This is just basic string operations... we are reading input from a file, processing it and outputting it to the new file. I am 100% certain that the first method takes roughly 600 times longer to run than the second, but why?

The file being processed is a csv but uses ~ instead of a comma. All we are doing here is taking this csv, which has a column for country, and adding a column for the countries region, e.g. LAC, EMEA, NA, etc... cmdbre.regions is just a dictionary, with all ~200 countries as the key and each region as the value.

Once I changed to the append string operation... the loop completed in 1 second instead of 10 minutes... 68,000 records in the csv.

like image 683
gunslingor Avatar asked Dec 06 '16 13:12

gunslingor


2 Answers

CPython (the reference interpreter) has an optimization for in-place string concatenation (when the string being appended to has no other references). It can't apply this optimization as reliably when doing +, only += (+ involves two live references, the assignment target and the operand, and the former isn't involved in the + operation, so it's harder to optimize it).

You shouldn't rely on this though, per PEP 8:

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).

For example, do not rely on CPython's efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b . This optimization is fragile even in CPython (it only works for some types) and isn't present at all in implementations that don't use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

Update based on question edits: Yeah, you broke the optimization. You concatenated many strings, not just one, and Python evaluates left-to-right, so it must do the left-most concatenation first. Thus:

new_file_content += line.strip() + "~" + row_region + "\n"

is not at all the same as:

new_file_content = new_file_content + line.strip() + "~" + row_region + "\n"

because the former concatenates all the new pieces together, then appends them to the accumulator string all at once, while the latter must evaluate each addition from left to right with temporaries that don't involve new_file_content itself. Adding parens for clarity, it's like you did:

new_file_content = (((new_file_content + line.strip()) + "~") + row_region) + "\n"

Because it doesn't actually know the types until it reaches them, it can't assume all of those are strings, so the optimization doesn't kick in.

If you changed the second bit of code to:

new_file_content = new_file_content + (line.strip() + "~" + row_region + "\n")

or slightly slower, but still many times faster than your slow code because it keeps the CPython optimization:

new_file_content = new_file_content + line.strip()
new_file_content = new_file_content + "~"
new_file_content = new_file_content + row_region
new_file_content = new_file_content + "\n"

so the accumulation was obvious to CPython, you'd fix the performance problem. But frankly, you should just be using += anytime you're performing a logical append operation like this; += exists for a reason, and it provides useful information to both the maintainer and the interpreter. Beyond that, it's good practice as far as DRY goes; why name the variable twice when you don't need to?

Of course, per PEP8 guidelines, even using += here is bad form. In most languages with immutable strings (including most non-CPython Python interpreters), repeated string concatenation is a form of Schlemiel the Painter's Algorithm, which causes serious performance problems. The correct solution is to build a list of strings, then join them all at one, e.g.:

    new_file_content = []
    for i, line in enumerate(content):
        if i==0:
            # In local tests, += anonymoustuple runs faster than
            # concatenating short strings and then calling append
            # Python caches small tuples, so creating them is cheap,
            # and using syntax over function calls is also optimized more heavily
            new_file_content += (line.strip(), "~region\n")
        else:
            country = line.split("~")[13]
            try:
                    row_region = regions[country]
            except KeyError:
                    row_region = "Undetermined"
            new_file_content += (line.strip(), "~", row_region, "\n")

    # Finished accumulating, make final string all at once
    new_file_content = "".join(new_file_content)

which is usually faster even when the CPython string concatenation options are available, and will be reliably fast on non-CPython Python interpreters as well because it uses a mutable list to accumulate results efficiently, then allows ''.join to precompute the total length of the string, allocate the final string all at once (instead of incremental resizes along the way), and populate it exactly once.

Side-note: For your specific case, you shouldn't be accumulating or concatenating at all. You've got an input file and an output file, and can process line by line. Every time you would append or accumulate file contents, just write them out instead (I've cleaned up the code a bit for PEP8 compliance and other minor style improvements while I was at it):

start_time = time.monotonic()  # You're on Py3, monotonic is more reliable for timing

# Use with statements for both input and output files
with open(fname) as f, open("CMDB_STAGE060.csv", "w") as new_file:
    # Iterate input file directly; readlines just means higher peak memory use
    # Maintaining your own counter is silly when enumerate exists
    for i, line in enumerate(f):
        if not i:
            # Write to file directly, don't store
            new_file.write(line.strip() + "~region\n")
        else:
            country = line.split("~")[13]
            # .get exists to avoid try/except when you have a simple, constant default
            row_region = regions.get(country, "Undetermined")
            # Write to file directly, don't store
            new_file.write(line.strip() + "~" + row_region + "\n")
end_time = time.monotonic()
# Print will stringify arguments and separate by spaces for you
print("total time:", end_time - start_time)

Implementation details deep dive

For those curious on implementation details, the CPython string concat optimization is implemented in the byte code interpreter, not on the str type itself (technically, PyUnicode_Append does the mutation optimization, but it requires help from the interpreter to fix up reference counts so it knows it can use the optimization safely; without interpreter help, only C extension modules would ever benefit from that optimization).

When the interpreter detects that both operands are the Python level str type (at C layer, in Python 3, it's still referred to as PyUnicode, a legacy of 2.x days that wasn't worth changing), it calls a special unicode_concatenate function, which checks whether the very next instruction is one of three basic STORE_* instructions. If it is, and the target is the same as the left operand, it clears the target reference so PyUnicode_Append will see only a single reference to the operand, allowing it to invoke the optimized code for a str with a single reference.

This means that not only can you break the optimization by doing

a = a + b + c

you can also break it any time the variable in question is not a top level (global, nested or local) name. If you're operating on an object attribute, a list index, a dict value, etc., even += won't help you, it won't see a "simple STORE", so it doesn't clear the target reference, and all of these get the ultraslow, not-in-place behavior:

foo.x += mystr
foo[0] += mystr
foo['x'] += mystr

It's also specific to the str type; in Python 2, the optimization doesn't help with unicode objects, and in Python 3, it doesn't help with bytes objects, and in neither version will it optimize for subclasses of str; those always take the slow path.

Basically, the optimization is there to be as nice as possible in the simplest common cases for people new to Python, but it's not going to go to serious trouble for even moderately more complex cases. This just reinforces the PEP8 recommendation: Depending on implementation details of your interpreter is a bad idea when you could run faster on every interpreter, for any store target, by doing the right thing and using str.join.

like image 172
ShadowRanger Avatar answered Nov 12 '22 06:11

ShadowRanger


Actually, both could be equally slow, but for some optimization which are actually an implementation detail on the official Python runtime (cPython).

Strings in Python are immutable - that means that when you do "str1 + str2", Python has to create a third string object, and copy over all the contents from str1 and from str2 to it - no matter how large any of these parts is.

The inplace operator allows Python to use some internal optimizations so that all data in str1 is not necessarily copied over again - and probably it even allows some buffer space for further concatenation options.

When one gets the feeling on how the language works, the way to build a large text body from small strings is to create a Python list with all the strings, and after the looping is over, make a single call to the str.join method passing in all string components. That will be consistently fast, even across Python implementations, and not depending on optimizations to be able to be triggered.

output = []
for ...:
    output.append(line)

new_file = "\n".join(output)
like image 32
jsbueno Avatar answered Nov 12 '22 08:11

jsbueno