Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python threading - internal buffer error - out of memory

I made a simple web crawler using urllib and beautifulsoup to extract data from a table on a webpage. To speed up the data pull I attempted to use threading, but I get the following errors: "internal buffer error : memory allocation failed : growing buffer" This message appears quite a few times and then states: "out of memory"

Thanks for the help.

from bs4 import BeautifulSoup
from datetime import datetime
import urllib2
import re
from threading import Thread

stockData = []

#Access the list of stocks to search for data
symbolfile = open("stocks.txt")
symbolslist = symbolfile.read()
newsymbolslist = symbolslist.split("\n")

#text file stock data is stored in
myfile = open("webcrawldata.txt","a")

#initializing data for extraction of web data
lineOfData = ""
i=0

def th(ur):
    stockData = []
    lineOfData = ""
    dataline = ""
    stats = ""
    page = ""
    soup = ""
    i=0
    #creates a timestamp for when program was won
    timestamp = datetime.now()
    #Get Data ONLINE
    #bloomberg stock quotes
    url= "http://www.bloomberg.com/quote/" + ur + ":US"
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    #Extract key stats table only
    stats = soup.find("table", {"class": "key_stat_data" })
    #iteration for <tr>
    j = 0
    try:
        for row in stats.findAll('tr'):
            stockData.append(row.find('td'))
            j += 1
        except AttributeError:
            print "Table handling error in HTML"
    k=0
    for cell in stockData:
        #clean up text
        dataline = stockData[k]
        lineOfData = lineOfData + " " + str(dataline)
        k += 1
    g = str(timestamp) + " " + str(ur)+ ' ' + str(lineOfData) + ' ' +  ("\n\n\n")    
    myfile.write(g)
    print (ur + "\n")
    del stockData[:]
    lineOfData = ""
    dataline = ""
    stats = None
    page = None
    soup = None
    i += 1

threadlist = []

for u in newsymbolslist:
    t = Thread(target = th, args = (u,))
    t.start()
    threadlist.append(t)

for b in threadlist:
    b.join()enter code here
like image 638
Jesse Avatar asked Oct 18 '25 15:10

Jesse


1 Answers

Each thread you start has a thread stack size, which is 8 kb by default in a Linux system (see ulimit -s), so the total number of memory needed for your threads would be more than 20 Gigabytes.

You can use a pool of threads, like for example 10 threads ; when one has finished its job, it gets another task to do.

But: running more threads than CPU cores is nonsense, in general. So my advice is to stop using threads. You can use libraries like gevent to do the exact same thing without using OS-level threads.

The nice thing about gevent is monkey-patching: you can tell gevent to alter the behaviour of the Python standard library, this will turn your threads into "greenlet" objects transparently (see gevent documentation for more details). The kind of concurrency proposed by gevent is particularly well-suited for intensive I/O as you're doing.

In your code, just add the following at the beginning:

from gevent import monkey; monkey.patch_all()

You can't have more than 1024 file descriptors opened at the same time on a Linux system by default (see ulimit -n) so you would have to increase this limit if you want your 2700 opened file descriptors at the same time.

like image 182
mguijarr Avatar answered Oct 21 '25 05:10

mguijarr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!