I made a simple web crawler using urllib and beautifulsoup to extract data from a table on a webpage. To speed up the data pull I attempted to use threading, but I get the following errors: "internal buffer error : memory allocation failed : growing buffer" This message appears quite a few times and then states: "out of memory"
Thanks for the help.
from bs4 import BeautifulSoup
from datetime import datetime
import urllib2
import re
from threading import Thread
stockData = []
#Access the list of stocks to search for data
symbolfile = open("stocks.txt")
symbolslist = symbolfile.read()
newsymbolslist = symbolslist.split("\n")
#text file stock data is stored in
myfile = open("webcrawldata.txt","a")
#initializing data for extraction of web data
lineOfData = ""
i=0
def th(ur):
stockData = []
lineOfData = ""
dataline = ""
stats = ""
page = ""
soup = ""
i=0
#creates a timestamp for when program was won
timestamp = datetime.now()
#Get Data ONLINE
#bloomberg stock quotes
url= "http://www.bloomberg.com/quote/" + ur + ":US"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
#Extract key stats table only
stats = soup.find("table", {"class": "key_stat_data" })
#iteration for <tr>
j = 0
try:
for row in stats.findAll('tr'):
stockData.append(row.find('td'))
j += 1
except AttributeError:
print "Table handling error in HTML"
k=0
for cell in stockData:
#clean up text
dataline = stockData[k]
lineOfData = lineOfData + " " + str(dataline)
k += 1
g = str(timestamp) + " " + str(ur)+ ' ' + str(lineOfData) + ' ' + ("\n\n\n")
myfile.write(g)
print (ur + "\n")
del stockData[:]
lineOfData = ""
dataline = ""
stats = None
page = None
soup = None
i += 1
threadlist = []
for u in newsymbolslist:
t = Thread(target = th, args = (u,))
t.start()
threadlist.append(t)
for b in threadlist:
b.join()enter code here
Each thread you start has a thread stack size, which is 8 kb by
default in a Linux system (see ulimit -s
), so the total number of memory needed for your threads would be more than 20 Gigabytes.
You can use a pool of threads, like for example 10 threads ; when one has finished its job, it gets another task to do.
But: running more threads than CPU cores is nonsense, in general. So my advice is to stop using threads. You can use libraries like gevent to do the exact same thing without using OS-level threads.
The nice thing about gevent is monkey-patching: you can tell gevent to alter the behaviour of the Python standard library, this will turn your threads into "greenlet" objects transparently (see gevent documentation for more details). The kind of concurrency proposed by gevent is particularly well-suited for intensive I/O as you're doing.
In your code, just add the following at the beginning:
from gevent import monkey; monkey.patch_all()
You can't have more than 1024 file descriptors opened at the same time on a
Linux system by default (see ulimit -n
) so you would have to increase
this limit if you want your 2700 opened file descriptors at the same time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With