Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Script with Gevent Pool, consumes a lot of memory, locks up

I have a very simple Python script using gevent.pool to download URLs (see below). The script runs fine for a couple of days and then locks up. I noticed that the memory usage is very high at that time. Am I using gevent incorrectly?

import sys

from gevent import monkey
monkey.patch_all()
import urllib2

from gevent.pool import Pool

inputFile = open(sys.argv[1], 'r')
urls = []
counter = 0
for line in inputFile:
    counter += 1
    urls.append(line.strip())
inputFile.close()

outputDirectory = sys.argv[2]

def fetch(url):
    try:
        body = urllib2.urlopen("http://" + url, None, 5).read()
        if len(body) > 0:
            outputFile = open(outputDirectory + "/" + url, 'w')
            outputFile.write(body)
            outputFile.close()
            print "Success", url
    except:
        pass

pool = Pool(int(sys.argv[3]))
pool.map(fetch, urls)
like image 571
Nikhil Avatar asked Nov 02 '22 20:11

Nikhil


1 Answers

        body = urllib2.urlopen("http://" + url, None, 5).read()

Above line reads the entire content in memory as a string. To prevent that, change fetch() as follow:

def fetch(url):
    try:
        u = urllib2.urlopen("http://" + url, None, 5)
        try:
            with open(outputDirectory + "/" + url, 'w') as outputFile:
                while True:
                    chunk = u.read(65536)
                    if not chunk:
                        break
                    outputFile.write(chunk)
        finally:
            u.close()
        print "Success", url
    except:
        print "Fail", url
like image 51
falsetru Avatar answered Nov 15 '22 04:11

falsetru