My concern is the following: I am storing a relativity large dataset in a classical python list and in order to process the data I must iterate over the list several times, perform some operations on the elements, and often pop an item out of the list.
It seems that deleting one item out of a Python list costs O(N) since Python has to copy all the items above the element at hand down one place. Furthermore, since the number of items to delete is approximately proportional to the number of elements in the list this results in an O(N^2) algorithm.
I am hoping to find a solution that is cost effective (time and memory-wise). I have studied what I could find on the internet and have summarized my different options below. Which one is the best candidate ?
while processingdata:
index = 0
while index < len(somelist):
item = somelist[index]
dosomestuff(item)
if somecondition(item):
del somelist[index]
else:
index += 1
This is the original solution I came up with. Not only is this not very elegant, but I am hoping there is better way to do it that remains time and memory efficient.
while processingdata:
for i in xrange(len(somelist) - 1, -1, -1):
dosomestuff(item)
if somecondition(somelist, i):
somelist.pop(i)
This avoids incrementing an index variable but ultimately has the same cost as the original version. It also breaks the logic of dosomestuff(item) that wishes to process them in the same order as they appear in the original list.
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
newlist = []
for item in somelist:
if somecondition(item):
newlist.append(item)
somelist = newlist
gc.collect()
This is a very naive strategy for eliminating elements from a list and requires lots of memory since an almost full copy of the list must be made.
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
somelist[:] = [x for x in somelist if somecondition(x)]
This is very elegant but under-the-cover it walks the whole list one more time and must copy most of the elements in it. My intuition is that this operation probably costs more than the original del statement at least memory wise. Keep in mind that somelist can be huge and that any solution that will iterate through it only once per run will probably always win.
while processingdata:
for i, item in enumerate(somelist):
dosomestuff(item)
somelist = filter(lambda x: not subtle_condition(x), somelist)
This also creates a new list occupying lots of RAM.
from itertools import ifilterfalse
while processingdata:
for item in itertools.ifilterfalse(somecondtion, somelist):
dosomestuff(item)
This version of the filter call does not create a new list but will not call dosomestuff on every item breaking the logic of the algorithm. I am including this example only for the purpose of creating an exhaustive list.
while processingdata:
index = 0
for item in somelist:
dosomestuff(item)
if not somecondition(item):
somelist[index] = item
index += 1
del somelist[index:]
This is a subtle method that seems cost effective. I think it will move each item (or the pointer to each item ?) exactly once resulting in an O(N) algorithm. Finally, I hope Python will be intelligent enough to resize the list at the end without allocating memory for a new copy of the list. Not sure though.
class Doubly_Linked_List:
def __init__(self):
self.first = None
self.last = None
self.n = 0
def __len__(self):
return self.n
def __iter__(self):
return DLLIter(self)
def iterator(self):
return self.__iter__()
def append(self, x):
x = DLLElement(x)
x.next = None
if self.last is None:
x.prev = None
self.last = x
self.first = x
self.n = 1
else:
x.prev = self.last
x.prev.next = x
self.last = x
self.n += 1
class DLLElement:
def __init__(self, x):
self.next = None
self.data = x
self.prev = None
class DLLIter:
etc...
This type of object resembles a python list in a limited way. However, deletion of an element is guaranteed O(1). I would not like to go here since this would require massive amounts of code refactoring almost everywhere.
When you create a list object, the list object by itself takes 64 bytes of memory, and each item adds 8 bytes of memory to the size of the list because of references to other objects.
As we know, Python uses the dynamic memory allocation which is managed by the Heap data structure. Memory Heap holds the objects and other data structures that will be used in the program. Python memory manager manages the allocation or de-allocation of the heap memory space through the API functions.
The programmer has to manually allocate memory before it can be used by the program and release it when the program no longer needs it. In Python, memory management is automatic! Python automatically handles the allocation and deallocation of memory.
Without knowing the specifics of what you're doing with this list, it's hard to know exactly what would be best in this case. If your processing stage depends on the current index of the list element, this won't work, but if not, it appears you've left off the most Pythonic (and in many ways, easiest) approach: generators.
If all you're doing is iterating over each element, processing it in some way, then either including that element in the list or not, use a generator. Then you never need to store the entire iterable in memory.
def process_and_generate_data(source_iterable):
for item in source_iterable:
dosomestuff(item)
if not somecondition(item):
yield item
You would need to have a processing loop that dealt with persisting the processed iterable (writing it back to a file, or whatever), or if you have multiple processing stages you'd prefer to separate into different generators you could have your processing loop pass one generator to the next.
From your description it sounds like a deque ("deck") would be exactly what you are looking for:
http://docs.python.org/library/collections.html#deque-objects
"Iterate" across it by repeatedly calling pop() and then, if you want to keep the popped item in the deque, returning that item to the front with appendleft(item). To keep up with when you're done iterating and have seen everything in the deque, either put in a marker object like None that you watch for, or just ask for the deque's len() when you start a particular loop and use range() to pop() exactly that many items.
I believe you will find all of the operations you need are then O(1).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With