Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does a generator expression need a lot of memory?

Problem

Let's assume that I want to find n**2 for all numbers smaller than 20000000.

General setup for all three variants that I test:

import time, psutil, gc  gc.collect() mem_before = psutil.virtual_memory()[3] time1 = time.time()  # (comprehension, generator, function)-code comes here  time2 = time.time() mem_after =  psutil.virtual_memory()[3]  print "Used Mem = ", (mem_after - mem_before)/(1024**2)  # convert Byte to Megabyte print "Calculation time = ", time2 - time1 

Three options to calculate these numbers:

1. Creating a list of via comprehension:

x = [i**2 for i in range(20000000)] 

It is really slow and time consuming:

Used Mem =  1270  # Megabytes Calculation time =  33.9309999943  # Seconds 

2. Creating a generator using '()':

x = (i**2 for i in range(20000000)) 

It is much faster than option 1, but still uses a lot of memory:

Used Mem =  611  Calculation time =  0.278000116348  

3. Defining a generator function (most efficient):

def f(n):     i = 0     while i < n:         yield i**2         i += 1 x = f(20000000) 

Its consumption:

Used Mem =  0 Calculation time =  0.0 

The questions are:

  1. What's the difference between the first and second solutions? Using () creates a generator, so why does it need a lot of memory?
  2. Is there any built-in function equivalent to my third option?
like image 307
EbraHim Avatar asked May 11 '16 08:05

EbraHim


People also ask

Why are generators memory efficient?

Unlike a list, a generator does not store values. Instead, it knows the current value and how to get the next one. This makes a generator memory-efficient. The syntax of looping through a generator is the same as looping through a list.

Does generators consume more memory than the lists?

The generator yields one item at a time and generates item only when in demand. Whereas, in a list comprehension, Python reserves memory for the whole list. Thus we can say that the generator expressions are memory efficient than the lists.

Is a generator more memory efficient?

Generators are memory-efficient ways of processing huge datasets. They process the data incrementally and do not allocate memory to all the results at the same time. They really come in handy when implementing data science pipelines for huge datasets in a resource-constrained environment (in terms of RAM).

Which is more memory efficient iterator or generator?

Implementing our own iterators requires us writing a class as shown earlier, generators don't need classes in python. Generators are faster than iterators but iterators are more memory-effecient.


1 Answers

  1. As others have pointed out in the comments, range creates a list in Python 2. Hence, it is not the generator per se that uses up the memory, but the range that the generator uses:

    x = (i**2 for i in range(20000000))   # builds a 2*10**7 element list, not for the squares , but for the bases  >>> sys.getsizeof(range(100)) 872 >>> sys.getsizeof(xrange(100)) 40 >>> sys.getsizeof(range(1000)) 8720 >>> sys.getsizeof(xrange(1000)) 40 >>> sys.getsizeof(range(20000000)) 160000072 >>> sys.getsizeof(xrange(20000000)) 40 

    This also explains why your second version (the generator expression) uses around half the memory of the first version (the list comprehension) as the first one builds two lists (for the bases and the squares) while the second only builds one list for the bases.

  2. xrange(20000000) thus, greatly improves memory usage as it returns a lazy iterable. This is essentially the built-in memory efficient way to iterate over a range of numbers that mirrors your third version (with the added flexibility of start, stop and step):

    x = (i**2 for i in xrange(20000000)) 

    In Python 3, range is essentially what xrange used to be in Python 2. However, the Python 3 range object has some nice features that Python 2's xrange doesn't have, like O(1) slicing, contains, etc.

Some references:

  • Python2 xrange docs
  • Python3 range docs
  • Stack Overflow - "Should you always favor xrange() over range()?"
  • Martijn Pieters excellent answer to "Why is 1000000000000000 in range(1000000000000001) so fast in Python 3?"
like image 104
user2390182 Avatar answered Sep 18 '22 14:09

user2390182