How to destroy Python objects and free up memory

Tags:

I am trying to iterate over 100,000 images and capture some image features and store the resulting dataFrame on disk as a pickle file.

Unfortunately due to RAM constraints, i am forced to split the images into chunks of 20,000 and perform operations on them before saving the results onto disk.

The code written below is supposed to save the dataframe of results for 20,000 images before starting the loop to process the next 20,000 images.

However - This does not seem to be solving my problem as the memory is not getting released from RAM at the end of the first for loop

So somewhere while processing the 50,000th record, the program crashes due to Out of Memory Error.

I tried deleting the objects after saving them to disk and invoking the garbage collector, however the RAM usage does not seem to be going down.

What am i missing?

#file_list_1 contains 100,000 images
file_list_chunks = list(divide_chunks(file_list_1,20000))
for count,f in enumerate(file_list_chunks):
    # make the Pool of workers
    pool = ThreadPool(64) 
    results = pool.map(get_image_features,f)
    # close the pool and wait for the work to finish 
    list_a, list_b = zip(*results)
    df = pd.DataFrame({'filename':list_a,'image_features':list_b})
    df.to_pickle("PATH_TO_FILE"+str(count)+".pickle")
    del list_a
    del list_b
    del df
    gc.collect()
    pool.close() 
    pool.join()
    print("pool closed")

847

asked May 14 '19 08:05

Thalish Sajeed

1 Answers

Now, it could be that something in the 50,000th is very large, and that's causing the OOM, so to test this I'd first try:

file_list_chunks = list(divide_chunks(file_list_1,20000))[30000:]

If it fails at 10,000 this will confirm whether 20k is too big a chunksize, or if it fails at 50,000 again, there is an issue with the code...

Okay, onto the code...

Firstly, you don't need the explicit list constructor, it's much better in python to iterate rather than generate the entire the list into memory.

file_list_chunks = list(divide_chunks(file_list_1,20000))
# becomes
file_list_chunks = divide_chunks(file_list_1,20000)

I think you might be misusing ThreadPool here:

Prevents any more tasks from being submitted to the pool. Once all the tasks have been completed the worker processes will exit.

This reads like close might have some thinks still running, although I guess this is safe it feels a little un-pythonic, it's better to use the context manager for ThreadPool:

with ThreadPool(64) as pool: 
    results = pool.map(get_image_features,f)
    # etc.

The explicit dels in python aren't actually guaranteed to free memory.

You should collect after the join/after the with:

with ThreadPool(..):
    ...
    pool.join()
gc.collect()

You could also try chunk this into smaller pieces e.g. 10,000 or even smaller!

Hammer 1

One thing, I would consider doing here, instead of using pandas DataFrames and large lists is to use a SQL database, you can do this locally with sqlite3:

import sqlite3
conn = sqlite3.connect(':memory:', check_same_thread=False)  # or, use a file e.g. 'image-features.db'

and use context manager:

with conn:
    conn.execute('''CREATE TABLE images
                    (filename text, features text)''')

with conn:
    # Insert a row of data
    conn.execute("INSERT INTO images VALUES ('my-image.png','feature1,feature2')")

That way, we won't have to handle the large list objects or DataFrame.

You can pass the connection to each of the threads... you might have to something a little weird like:

results = pool.map(get_image_features, zip(itertools.repeat(conn), f))

Then, after the calculation is complete you can select all from the database, into which ever format you like. E.g. using read_sql.

Hammer 2

Use a subprocess here, rather than running this in the same instance of python "shell out" to another.

Since you can pass start and end to python as sys.args, you can slice these:

# main.py
# a for loop to iterate over this
subprocess.check_call(["python", "chunk.py", "0", "20000"])

# chunk.py a b
for count,f in enumerate(file_list_chunks):
    if count < int(sys.argv[1]) or count > int(sys.argv[2]):
         pass
    # do stuff

That way, the subprocess will properly clean up python (there's no way there'll be memory leaks, since the process will be terminated).

My bet is that Hammer 1 is the way to go, it feels like you're gluing up a lot of data, and reading it into python lists unnecessarily, and using sqlite3 (or some other database) completely avoids that.

answered Oct 01 '22 10:10

Andy Hayden

Related questions
                            
                                Models inside tests - Django 1.7 issue
                            
                                How to return str from MySQL using mysql.connector?
                            
                                Pandas DataFrame: complete spec for __getitem__()? [closed]
                            
                                Accessing JVM from python
                            
                                Why does this Jython loop fail after a single run?
                            
                                Headless Selenium + Xvfb + Chrome on OSX 10.11
                            
                                Disable special "class" attribute handling
                            
                                Sending JSON data over WebSocket from Matlab using Python Twisted and Autobahn
                            
                                Django REST Framework (ModelViewSet), 405 METHOD NOT ALLOWED
                            
                                JSON formatted logging with Flask and gunicorn
                            
                                completely self-contained virtual environment
                            
                                Nuitka error Cannot find ' ' in package ' ' as absolute import
                            
                                pandas get_level_values for multiple columns
                            
                                PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`
                            
                                Python Jupyter: Shortcut to copy output of a cell
                            
                                How can we build and distribute python scripts in a windows environment
                            
                                How to keep tensorflow session open between predictions? Loading from SavedModel
                            
                                How to save a custom transformer in sklearn?
                            
                                Tensorflow CNN training images are all different sizes
                            
                                Why are Python operations 30× slower after calling time.sleep or subprocess.Popen?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to destroy Python objects and free up memory

Tags:

python

memory-management

pandas

out-of-memory