I have a very simple Python script using gevent.pool to download URLs (see below). The script runs fine for a couple of days and then locks up. I noticed that the memory usage is very high at that time. Am I using gevent incorrectly?
import sys
from gevent import monkey
monkey.patch_all()
import urllib2
from gevent.pool import Pool
inputFile = open(sys.argv[1], 'r')
urls = []
counter = 0
for line in inputFile:
counter += 1
urls.append(line.strip())
inputFile.close()
outputDirectory = sys.argv[2]
def fetch(url):
try:
body = urllib2.urlopen("http://" + url, None, 5).read()
if len(body) > 0:
outputFile = open(outputDirectory + "/" + url, 'w')
outputFile.write(body)
outputFile.close()
print "Success", url
except:
pass
pool = Pool(int(sys.argv[3]))
pool.map(fetch, urls)
body = urllib2.urlopen("http://" + url, None, 5).read()
Above line reads the entire content in memory as a string. To prevent that, change fetch() as follow:
def fetch(url):
try:
u = urllib2.urlopen("http://" + url, None, 5)
try:
with open(outputDirectory + "/" + url, 'w') as outputFile:
while True:
chunk = u.read(65536)
if not chunk:
break
outputFile.write(chunk)
finally:
u.close()
print "Success", url
except:
print "Fail", url
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With