Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python-requests not clearing memory when downloading with sessions

I have an application where I use requests to download .mp3 files from a server.

The code looks like this:

self.client = requests.session(headers={'User-Agent': self.useragent})

def download(self, url, name):
    request = self.client.get(url)

    with open(name, "wb") as code:
        code.write(request.content)

    print "done"

The problem is that when the download is finished, python does not clear the memory, so everytime I download an mp3, the memory usage of the application raises by the size of the mp3. The memory does not get cleared again, leading to my app using a lot of memory.

I assume this has to do with how I save the file, or how requests.session works.

Any suggestions.

Edit: Here is the code: https://github.com/Simon1988/VK-Downloader

The relevant part is in lib/vklib.py

like image 482
scandinavian_ Avatar asked Jan 11 '13 01:01

scandinavian_


1 Answers

I don't think there's an actual problem here, beyond you not understanding how memory allocation works.

When Python needs more memory, it asks the OS for more. When it's done with that memory, it generally does not return it to the OS; instead, it holds onto it for later objects.

So, when you open the first 10MB mp3, your memory use goes from, say, 3MB to 13MB. Then you free up that memory, but you're still at 13MB. Then you open a second 10MB mp3, but it reuses the same memory, so you're still at 13MB. And so on.

In your code, you're creating a thread for each download. If you have 5 threads at a time, all using 10MB, obviously that means you're using 50MB. And that 50MB won't be released. But if you wait for them to finish, then do another 5 downloads, it'll reuse the same 50MB again.

Since your code doesn't limit the number of threads in any way, there's nothing (short of CPU speed and context-switching costs) to stop you from kicking off hundreds of threads, each using 10MB, meaning gigabytes of RAM. But just switching to a thread pool, or not letting the user kick off more downloads if too many are gong on, etc., will solve that.

So, usually, this is not a problem. But if it is, there are two ways around it:

  1. Create a child process (e.g., via the multiprocessing module) to do the memory-hogging work. On any modern OS, when a process goes away, its memory is reclaimed. The problem here is that allocating and releasing 10MB over and over again is actually going to slow your system down, not speed it up—and the cost of process startup (especially on Windows) will make it even worse. So, you'll probably want a much larger batch of jobs to spin off to a c child process.

  2. Don't read the whole thing into memory at once; use a streaming API instead of a whole-file API. With requests, this means setting stream=True in the initial request, and then usually using r.raw.read(8192), r.iter_content(), or r.iter_lines() in a loop instead of accessing r.content.

like image 177
abarnert Avatar answered Sep 26 '22 15:09

abarnert