Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Memory usage and optimization when modifying lists

Tags:

The problem

My concern is the following: I am storing a relativity large dataset in a classical python list and in order to process the data I must iterate over the list several times, perform some operations on the elements, and often pop an item out of the list.

It seems that deleting one item out of a Python list costs O(N) since Python has to copy all the items above the element at hand down one place. Furthermore, since the number of items to delete is approximately proportional to the number of elements in the list this results in an O(N^2) algorithm.

I am hoping to find a solution that is cost effective (time and memory-wise). I have studied what I could find on the internet and have summarized my different options below. Which one is the best candidate ?

Keeping a local index:

while processingdata:
    index = 0
    while index < len(somelist):
        item = somelist[index]
        dosomestuff(item)
        if somecondition(item):
            del somelist[index]
        else:
            index += 1

This is the original solution I came up with. Not only is this not very elegant, but I am hoping there is better way to do it that remains time and memory efficient.

Walking the list backwards:

while processingdata:
    for i in xrange(len(somelist) - 1, -1, -1):
        dosomestuff(item)
        if somecondition(somelist, i):
            somelist.pop(i)

This avoids incrementing an index variable but ultimately has the same cost as the original version. It also breaks the logic of dosomestuff(item) that wishes to process them in the same order as they appear in the original list.

Making a new list:

while processingdata:
    for i, item in enumerate(somelist):
        dosomestuff(item)
    newlist = []
    for item in somelist:
        if somecondition(item):
            newlist.append(item)
    somelist = newlist
    gc.collect()

This is a very naive strategy for eliminating elements from a list and requires lots of memory since an almost full copy of the list must be made.

Using list comprehensions:

while processingdata:
    for i, item in enumerate(somelist):
        dosomestuff(item)
    somelist[:] = [x for x in somelist if somecondition(x)]

This is very elegant but under-the-cover it walks the whole list one more time and must copy most of the elements in it. My intuition is that this operation probably costs more than the original del statement at least memory wise. Keep in mind that somelist can be huge and that any solution that will iterate through it only once per run will probably always win.

Using the filter function:

while processingdata:
    for i, item in enumerate(somelist):
        dosomestuff(item)
    somelist = filter(lambda x: not subtle_condition(x), somelist)

This also creates a new list occupying lots of RAM.

Using the itertools' filter function:

from itertools import ifilterfalse
while processingdata:
     for item in itertools.ifilterfalse(somecondtion, somelist):
         dosomestuff(item)

This version of the filter call does not create a new list but will not call dosomestuff on every item breaking the logic of the algorithm. I am including this example only for the purpose of creating an exhaustive list.

Moving items up the list while walking

while processingdata:
    index = 0
    for item in somelist:
        dosomestuff(item)
        if not somecondition(item):
            somelist[index] = item
            index += 1
    del somelist[index:]

This is a subtle method that seems cost effective. I think it will move each item (or the pointer to each item ?) exactly once resulting in an O(N) algorithm. Finally, I hope Python will be intelligent enough to resize the list at the end without allocating memory for a new copy of the list. Not sure though.

Abandoning Python lists:

class Doubly_Linked_List:
    def __init__(self):
        self.first = None
        self.last = None
        self.n = 0
    def __len__(self):
        return self.n
    def __iter__(self):
        return DLLIter(self)
    def iterator(self):
        return self.__iter__()
    def append(self, x):
        x = DLLElement(x)
        x.next = None
        if self.last is None:
            x.prev = None
            self.last = x
            self.first = x
            self.n = 1
        else:
            x.prev = self.last
            x.prev.next = x
            self.last = x
            self.n += 1

class DLLElement:
    def __init__(self, x):
    self.next = None
    self.data = x
    self.prev = None

class DLLIter:
    etc...

This type of object resembles a python list in a limited way. However, deletion of an element is guaranteed O(1). I would not like to go here since this would require massive amounts of code refactoring almost everywhere.

like image 276
xApple Avatar asked Apr 13 '10 15:04

xApple


People also ask

How much memory do Python lists use?

When you create a list object, the list object by itself takes 64 bytes of memory, and each item adds 8 bytes of memory to the size of the list because of references to other objects.

How memory is managed in Python list?

As we know, Python uses the dynamic memory allocation which is managed by the Heap data structure. Memory Heap holds the objects and other data structures that will be used in the program. Python memory manager manages the allocation or de-allocation of the heap memory space through the API functions.

Does Python automatically allocate memory?

The programmer has to manually allocate memory before it can be used by the program and release it when the program no longer needs it. In Python, memory management is automatic! Python automatically handles the allocation and deallocation of memory.


2 Answers

Without knowing the specifics of what you're doing with this list, it's hard to know exactly what would be best in this case. If your processing stage depends on the current index of the list element, this won't work, but if not, it appears you've left off the most Pythonic (and in many ways, easiest) approach: generators.

If all you're doing is iterating over each element, processing it in some way, then either including that element in the list or not, use a generator. Then you never need to store the entire iterable in memory.

def process_and_generate_data(source_iterable):
    for item in source_iterable:
        dosomestuff(item)
        if not somecondition(item):
            yield item

You would need to have a processing loop that dealt with persisting the processed iterable (writing it back to a file, or whatever), or if you have multiple processing stages you'd prefer to separate into different generators you could have your processing loop pass one generator to the next.

like image 154
Jeffrey Harris Avatar answered Oct 01 '22 05:10

Jeffrey Harris


From your description it sounds like a deque ("deck") would be exactly what you are looking for:

http://docs.python.org/library/collections.html#deque-objects

"Iterate" across it by repeatedly calling pop() and then, if you want to keep the popped item in the deque, returning that item to the front with appendleft(item). To keep up with when you're done iterating and have seen everything in the deque, either put in a marker object like None that you watch for, or just ask for the deque's len() when you start a particular loop and use range() to pop() exactly that many items.

I believe you will find all of the operations you need are then O(1).

like image 43
Brandon Rhodes Avatar answered Oct 01 '22 04:10

Brandon Rhodes