Python threading - internal buffer error - out of memory

Question

I made a simple web crawler using urllib and beautifulsoup to extract data from a table on a webpage. To speed up the data pull I attempted to use threading, but I get the following errors: "internal buffer error : memory allocation failed : growing buffer" This message appears quite a few times and then states: "out of memory"

Thanks for the help.

from bs4 import BeautifulSoup
from datetime import datetime
import urllib2
import re
from threading import Thread

stockData = []

#Access the list of stocks to search for data
symbolfile = open("stocks.txt")
symbolslist = symbolfile.read()
newsymbolslist = symbolslist.split("
")

#text file stock data is stored in
myfile = open("webcrawldata.txt","a")

#initializing data for extraction of web data
lineOfData = ""
i=0

def th(ur):
    stockData = []
    lineOfData = ""
    dataline = ""
    stats = ""
    page = ""
    soup = ""
    i=0
    #creates a timestamp for when program was won
    timestamp = datetime.now()
    #Get Data ONLINE
    #bloomberg stock quotes
    url= "http://www.bloomberg.com/quote/" + ur + ":US"
    page = urllib2.urlopen(url)
    soup = BeautifulSoup(page.read())
    #Extract key stats table only
    stats = soup.find("table", {"class": "key_stat_data" })
    #iteration for <tr>
    j = 0
    try:
        for row in stats.findAll('tr'):
            stockData.append(row.find('td'))
            j += 1
        except AttributeError:
            print "Table handling error in HTML"
    k=0
    for cell in stockData:
        #clean up text
        dataline = stockData[k]
        lineOfData = lineOfData + " " + str(dataline)
        k += 1
    g = str(timestamp) + " " + str(ur)+ ' ' + str(lineOfData) + ' ' +  ("


")    
    myfile.write(g)
    print (ur + "
")
    del stockData[:]
    lineOfData = ""
    dataline = ""
    stats = None
    page = None
    soup = None
    i += 1

threadlist = []

for u in newsymbolslist:
    t = Thread(target = th, args = (u,))
    t.start()
    threadlist.append(t)

for b in threadlist:
    b.join()enter code here

mguijarr · Accepted Answer

Each thread you start has a thread stack size, which is 8 kb by default in a Linux system (see ulimit -s), so the total number of memory needed for your threads would be more than 20 Gigabytes.

You can use a pool of threads, like for example 10 threads ; when one has finished its job, it gets another task to do.

But: running more threads than CPU cores is nonsense, in general. So my advice is to stop using threads. You can use libraries like gevent to do the exact same thing without using OS-level threads.

The nice thing about gevent is monkey-patching: you can tell gevent to alter the behaviour of the Python standard library, this will turn your threads into "greenlet" objects transparently (see gevent documentation for more details). The kind of concurrency proposed by gevent is particularly well-suited for intensive I/O as you're doing.

In your code, just add the following at the beginning:

from gevent import monkey; monkey.patch_all()

You can't have more than 1024 file descriptors opened at the same time on a Linux system by default (see ulimit -n) so you would have to increase this limit if you want your 2700 opened file descriptors at the same time.

Python threading - internal buffer error - out of memory

Tags:

python

python-multithreading

beautifulsoup

out-of-memory

web-crawler

Jesse

1 Answers

mguijarr

Recent Activity

Donate For Us

Python threading - internal buffer error - out of memory

Tags:

python

python-multithreading

beautifulsoup

out-of-memory

web-crawler

Jesse

1 Answers

mguijarr

Related questions

Recent Activity

Donate For Us