How to free memory after opening a file in Python

Tags:

I'm opening a 3 GB file in Python to read strings. I then store this data in a dictionary. My next goal is to build a graph using this dictionary so I'm closely monitoring memory usage.

It seems to me that Python loads the whole 3 GB file into memory and I can't get rid of it. My code looks like that :

with open(filename) as data:

    accounts = dict()

    for line in data:
        username = line.split()[1]
        IP = line.split()[0]

        try:
            accounts[username].add(IP)
        except KeyError:
            accounts[username] = set()
            accounts[username].add(IP)

print "The accounts will be deleted from memory in 5 seconds"
time.sleep(5)
accounts.clear()

print "The accounts have been deleted from memory"
time.sleep(5)

print "End of script"

The last lines are there so that I could monitor memory usage. The script uses a bit more than 3 GB in memory. Clearing the dictionary frees around 300 MB. When the script ends, the rest of the memory is freed.

I'm using Ubuntu and I've monitored memory usage using both "System Monitor" and the "free" command in terminal.

What I don't understand is why does Python need so much memory after I've cleared the dictionary. Is the file still stored in memory ? If so, how can I get rid of it ? Is it a problem with my OS not seeing freed memory ?

EDIT : I've tried to force a gc.collect() after clearing the dictionary, to no avail.

EDIT2 : I'm running Python 2.7.3 on Ubuntu 12.04.LTS

EDIT3 : I realize I forgot to mention something quite important. My real problem is not that my OS does not "get back" the memory used by Python. It's that later on, Python does not seem to reuse that memory (it just asks for more memory to the OS).

659

asked Sep 13 '12 22:09

Pierre Mourlanne

2 Answers

this really does make no sense to me either, and I wanted to figure out how/why this happens. ( i thought that's how this should work too! ) i replicated it on my machine - though with a smaller file.

i saw two discrete problems here

why is Python reading the file into memory ( with lazy line reading, it shouldn't - right ? )
why isn't Python freeing up memory to the system

I'm not knowledgable at all on the Python internals, so I just did a lot of web searching. All of this could be completely off the mark. ( I barely develop anymore , have been on the biz side of tech for the past few years )

Lazy line reading...

I looked around and found this post -

http://www.peterbe.com/plog/blogitem-040312-1

it's from a much earlier version of python, but this line resonated with me:

readlines() reads in the whole file at once and splits it by line.

then i saw this , also old, effbot post:

http://effbot.org/zone/readline-performance.htm

the key takeaway was this:

For example, if you have enough memory, you can slurp the entire file into memory, using the readlines method.

and this:

In Python 2.2 and later, you can loop over the file object itself. This works pretty much like readlines(N) under the covers, but looks much better

looking at pythons docs for xreadlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.xreadlines ]:

This method returns the same thing as iter(f) Deprecated since version 2.3: Use for line in file instead.

it made me think that perhaps some slurping is going on.

so if we look at readlines [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readlines ]...

Read until EOF using readline() and return a list containing the lines thus read.

and it sort of seems like that's what's happening here.

readline , however, looked like what we wanted [ http://docs.python.org/library/stdtypes.html?highlight=readline#file.readline ]

Read one entire line from the file

so i tried switching this to readline, and the process never grew above 40MB ( it was growing to 200MB, the size of the log file , before )

accounts = dict()
data= open(filename)
for line in data.readline():
    info = line.split("LOG:")
    if len(info) == 2 :
        ( a , b ) = info
        try:
            accounts[a].add(True)
        except KeyError:
            accounts[a] = set()
            accounts[a].add(True)

my guess is that we're not really lazy-reading the file with the for x in data construct -- although all the docs and stackoverflow comments suggest that we are. readline() consumed signficantly less memory for me, and realdlines consumed approximately the same amount of memory as for line in data

freeing memory

in terms of freeing up memory, I'm not familiar much with Python's internals, but I recall back from when I worked with mod_perl... if I opened up a file that was 500MB, that apache child grew to that size. if I freed up the memory, it would only be free within that child -- garbage collected memory was never returned to the OS until the process exited.

so i poked around on that idea , and found a few links that suggest this might be happening:

http://effbot.org/pyfaq/why-doesnt-python-release-the-memory-when-i-delete-a-large-object.htm

If you create a large object and delete it again, Python has probably released the memory, but the memory allocators involved don’t necessarily return the memory to the operating system, so it may look as if the Python process uses a lot more virtual memory than it actually uses.

that was sort of old, and I found a bunch of random (accepted) patches afterwards into python that suggested the behavior was changed and that you could now return memory to the os ( as of 2005 when most of those patches were submitted and apparently approved ).

then i found this posting http://objectmix.com/python/17293-python-memory-handling.html -- and note the comment #4

"""- Patch #1123430: Python's small-object allocator now returns an arena to the system free() when all memory within an arena becomes unused again. Prior to Python 2.5, arenas (256KB chunks of memory) were never freed. Some applications will see a drop in virtual memory size now, especially long-running applications that, from time to time, temporarily use a large number of small objects. Note that when Python returns an arena to the platform C's free(), there's no guarantee that the platform C library will in turn return that memory to the operating system. The effect of the patch is to stop making that impossible, and in tests it appears to be effective at least on Microsoft C and gcc-based systems. Thanks to Evan Jones for hard work and patience.

So with 2.4 under linux (as you tested) you will indeed not always get the used memory back, with respect to lots of small objects being collected.

The difference therefore (I think) you see between doing an f.read() and an f.readlines() is that the former reads in the whole file as one large string object (i.e. not a small object), while the latter returns a list of lines where each line is a python object.

if the 'for line in data:' construct is essentially wrapping readlines and not readline, maybe this has something to do with it? perhaps it's not a problem of having a single 3GB object, but instead having millions of 30k objects.

192

answered Oct 06 '22 15:10

Jonathan Vanasco

Which version of python that are you trying this?

I did a test on Python 2.7/Win7, and it worked as expected, the memory was released.

Here I generate sample data like yours:

import random

fn = random.randint

with open('ips.txt', 'w') as f: 
    for i in xrange(9000000):
        f.write('{0}.{1}.{2}.{3} username-{4}\n'.format(
            fn(0,255),
            fn(0,255),
            fn(0,255),
            fn(0,255),
            fn(0, 9000000),
        ))

And then your script. I replaced dict by defaultdict because throwing exceptions makes the code slower:

import time
from collections import defaultdict

def read_file(filename):
    with open(filename) as data:

        accounts = defaultdict(set)

        for line in data:
            IP, username = line.split()[:2]
            accounts[username].add(IP)

    print "The accounts will be deleted from memory in 5 seconds"
    time.sleep(5)
    accounts.clear()

    print "The accounts have been deleted from memory"
    time.sleep(5)

    print "End of script"

if __name__ == '__main__':
    read_file('ips.txt')

As you can see, memory reached 1.4G and was then released, leaving 36MB:

Memory usage with defaultdict

Using your original script I got the same results, but a bit slower:

enter image description here

answered Oct 06 '22 14:10

Fernando Macedo

Related questions
                            
                                Why is post_save being raised twice during the save of a Django model?
                            
                                Python: How can I override one module in a package with a modified version that lives outside the package?
                            
                                Mapping module imports in Python for easy refactoring
                            
                                Python, safe, sandbox [duplicate]
                            
                                How to remove "__main__." from the beginning of user-created exception classes in Python
                            
                                How to connect pyqtSignal between classes in PyQT
                            
                                A good way to encrypt database fields?
                            
                                Partial order sorting?
                            
                                Can I have an ellipsis at the beginning of the line in a Python doctest?
                            
                                Transfer layout from networkx to cytoscape
                            
                                Tornado AsyncHTTPClient fetch callback: Extra parameters?
                            
                                TypeError: cannot deepcopy this pattern object
                            
                                Is there a JavaScript (ECMAScript) implementation written in Python?
                            
                                Calling an overridden method, superclass an calls overridden method
                            
                                Is it bad form to raise ArgumentError by hand?
                            
                                Does Python have a similar control mechanism to Java's CountDownLatch?
                            
                                Passing additional arguments using scipy.optimize.curve_fit?
                            
                                Python context manager: conditionally executing body?
                            
                                How to embed a Python interpreter on a website
                            
                                Communication between two computers using python socket

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to free memory after opening a file in Python

Tags:

python

file-io

memory

large-files