Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursion depth error when using BeautifulSoup with multiprocessing pool map

I have been using BeautifulSoup for parsing html files, while all the scripts i write work good but slow. So i am experimenting on using multiprocessing pool of workers along with BeautifulSoup so my program can run more faster (I have like 100,000 - 1,000,000 html files to open). The script i wrote is more complex but i have written down an small example down here. I am trying to do something like this and i keep getting the error

'RuntimeError: maximum recursion depth exceeded while pickling an object'

Edited Code

from bs4 import BeautifulSoup
from multiprocessing import Pool
def extraction(path):
   soup=BeautifulSoup(open(path),"lxml")
   return soup.title

pool=Pool(processes=4)
path=['/Volume3/2316/http/www.metro.co.uk/news/852300-haiti-quake-victim-footballers-stage-special-tournament/crawlerdefault.html','/Volume3/2316/http/presszoom.com/story_164020.html']
print pool.map(extraction,path)
pool.close()
pool.join()

After doing some search and digging through some posts, i got to know that the error is occurring because BeautifulSoup is exceeding the depth of the python interpreter stack. I tried to raise the limit and run the same program ( i went up to 3000) but the error remains same. I stopped raising the limit because the problem is with BeautifulSoup when opening the html files.

Using multiprocessing with BeautifulSoup will speed my execution time, but i am not able to figure out how to apply it to open the files.

Does anyone have any other approach on how to use BeautifulSoup with multiprocessing or how to come over these kind of errors ?

Any kind of help will be appreciated, i am sitting for hours trying to fix it and understand why i am getting the error.

Edit

I tested the above code with the files i have given in the paths and i got the same RuntimeError as above

The files can be accessed here (http://ec2-23-20-166-224.compute-1.amazonaws.com/sites/html_files/)

like image 942
kich Avatar asked Apr 29 '12 17:04

kich


1 Answers

I think the reason is the returning of the whole soup.title-object. It seems, it all the children and parent elements and their children and parents and so on are analysed in this moment, and this raises the recursion error.

If the content of the object is what you need, you can simply call the str-method:

return soup.title.__str__()

Unfortunatly, this means, you don’t have access to all the other information provided by the bs4-library anymore.

like image 161
Sebastian Werk Avatar answered Oct 04 '22 05:10

Sebastian Werk