Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I loop through a large dataset in python without getting a MemoryError?

Tags:

python

memory

I have a large series of raster datasets representing monthly rainfall over several decades. I've written a script in Python that loops over each raster and does the following:

  1. Converts the raster to a numpy masked array,
  2. Performs lots of array algebra to calculate a new water level,
  3. Writes the result to an output raster.
  4. Repeats

The script is just a long list of array algebra equations enclosed by a loop statement.

Everything works well if I just run the script on a small part of my data (say 20 years' worth), but if I try to process the whole lot I get a MemoryError. The error doesn't give any more information than that (except it highlights the line in the code at which Python gave up).

Unfortunately, I can't easily process my data in chunks - I really need to be able to do the whole lot at once. This is because, at the end of each iteration, the output (water level) is fed back into the next iteration as the start point.

My understanding of programming is very basic at present, but I thought that all of my objects would just be overwritten on each loop. I (stupidly?) assumed that if the code managed to loop successfully once then it should be able to loop indefinitely without using up more and more memory.

I've tried reading various bits of documentation and have discovered something called the "Garbage Collector", but I feel like I'm getting out of my depth and my brain's melting! Can anyone offer some basic insight into what actually happens to objects in memory when my code loops? Is there a way of freeing-up memory at the end of each loop, or is there some more "Pythonic" way of coding which avoids this problem altogether?

like image 844
James Avatar asked Nov 04 '10 15:11

James


People also ask

How to avoid memory Error in Python?

To fix this, all you have to do is install the 64-bit version of the Python programming language. A 64-bit computer system can access 2⁶⁴ different memory addresses or 18-Quintillion bytes of RAM. If you have a 64-bit computer system, you must use the 64-bit version of Python to play with its full potential.

Can you loop through a number in Python?

To loop through a set of code a specified number of times, we can use the range() function, The range() function returns a sequence of numbers, starting from 0 by default, and increments by 1 (by default), and ends at a specified number.

What can you iterate over in Python?

These include the string, list, tuple, dict, set, and frozenset types. But these are by no means the only types that you can iterate over. Many objects that are built into Python or defined in modules are designed to be iterable.


2 Answers

You don't need to concern youself with memory management, especially not with the garbage collector that has a very specific task that you most likely don't even use. Python will always collect the memory it can and reuse it.

There are just two reasons for your problem: Either the data you try to load is too much to fit into memory or your calculations store data somewhere (a list, dict, something persistent between iterations) and that storage grows and grows. Memory profilers can help finding that.

like image 92
Jochen Ritzel Avatar answered Oct 25 '22 00:10

Jochen Ritzel


a quick way to "force" the garbage collector to clean the temporary loop-only objects is the del statement:

for obj in list_of_obj:   
    data = obj.getData()  
    do_stuff(data)   
    del data 

this forces the interpreter to delete and free the temporary objects. NOTE: this does not make sure the program does not leak or consume memory in other parts of the computation, it's just a quick check

like image 33
uovobw Avatar answered Oct 25 '22 01:10

uovobw