Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python function slows down with presence of large list

I was testing the speeds of a few different ways to do complex iterations over some of my data, and I found something weird. It seems that having a large list local to some function slows down that function considerably, even if it is not touching that list. For example, creating 2 independent lists via 2 instances of the same generator function is about 2.5x slower the second time. If the first list is removed prior to creating the second, both iterators go at the same spee.

def f():  
    l1, l2 = [], []  
    for c1, c2 in generatorFxn():  
        l1.append((c1, c2))  
    # destroying l1 here fixes the problem 
    for c3, c4 in generatorFxn():  
        l2.append((c3, c4))

The lists end up about 3.1 million items long each, but I saw the same effect with smaller lists too. The first for loop takes about 4.5 seconds to run, the second takes 10.5. If I insert l1= [] or l1= len(l1) at the comment position, both for loops take 4.5 seconds.

Why does the speed of local memory allocation in a function have anything to do with the current size of that function's variables?

EDIT: Disabling the garbage collector fixes everything, so must be due to it running constantly. Case closed!

like image 843
DaveTheScientist Avatar asked Apr 28 '11 18:04

DaveTheScientist


3 Answers

When you create that many new objects (3 million tuples), the garbage collector gets bogged down. If you turn off garbage collection with gc.disable(), the issue goes away (and the program runs 4x faster to boot).

like image 106
Luke Avatar answered Oct 20 '22 06:10

Luke


It's impossible to say without more detailed instrumentation.

As a very, very preliminary step, check your main memory usage. If your RAM is all filled up and your OS is paging to disk, your performance will be quite dreadful. In such a case, you may be best off taking your intermediate products and putting them somewhere other than in memory. If you only need sequential reads of your data, consider writing to a plain file; if your data follows a strict structure, consider persisting into a relational database.

like image 33
Wesley Avatar answered Oct 20 '22 04:10

Wesley


My guess is that when the first list is made, there is more memory available, meaning less chance that the list needs to be reallocated as it grows.

After you take up a decent chunk of memory with the first list, your second list has a higher chance of needing to be reallocated as it grows since python lists are dynamically sized.

like image 28
amillerrhodes Avatar answered Oct 20 '22 04:10

amillerrhodes