Recursion depth error when using BeautifulSoup with multiprocessing pool map

Question

I have been using BeautifulSoup for parsing html files, while all the scripts i write work good but slow. So i am experimenting on using multiprocessing pool of workers along with BeautifulSoup so my program can run more faster (I have like 100,000 - 1,000,000 html files to open). The script i wrote is more complex but i have written down an small example down here. I am trying to do something like this and i keep getting the error

'RuntimeError: maximum recursion depth exceeded while pickling an object'

Edited Code

from bs4 import BeautifulSoup
from multiprocessing import Pool
def extraction(path):
   soup=BeautifulSoup(open(path),"lxml")
   return soup.title

pool=Pool(processes=4)
path=['/Volume3/2316/http/www.metro.co.uk/news/852300-haiti-quake-victim-footballers-stage-special-tournament/crawlerdefault.html','/Volume3/2316/http/presszoom.com/story_164020.html']
print pool.map(extraction,path)
pool.close()
pool.join()

After doing some search and digging through some posts, i got to know that the error is occurring because BeautifulSoup is exceeding the depth of the python interpreter stack. I tried to raise the limit and run the same program ( i went up to 3000) but the error remains same. I stopped raising the limit because the problem is with BeautifulSoup when opening the html files.

Using multiprocessing with BeautifulSoup will speed my execution time, but i am not able to figure out how to apply it to open the files.

Does anyone have any other approach on how to use BeautifulSoup with multiprocessing or how to come over these kind of errors ?

Any kind of help will be appreciated, i am sitting for hours trying to fix it and understand why i am getting the error.

Edit

I tested the above code with the files i have given in the paths and i got the same RuntimeError as above

The files can be accessed here (http://ec2-23-20-166-224.compute-1.amazonaws.com/sites/html_files/)

Sebastian Werk · Accepted Answer

I think the reason is the returning of the whole soup.title-object. It seems, it all the children and parent elements and their children and parents and so on are analysed in this moment, and this raises the recursion error.

If the content of the object is what you need, you can simply call the str-method:

return soup.title.__str__()

Unfortunatly, this means, you don’t have access to all the other information provided by the bs4-library anymore.

Recursion depth error when using BeautifulSoup with multiprocessing pool map

Tags:

python

beautifulsoup

multiprocessing

kich

1 Answers

Sebastian Werk

Recent Activity

Donate For Us

Recursion depth error when using BeautifulSoup with multiprocessing pool map

Tags:

python

beautifulsoup

multiprocessing

kich

1 Answers

Sebastian Werk

Related questions

Recent Activity

Donate For Us