I have been using BeautifulSoup for parsing html files, while all the scripts i write work good but slow. So i am experimenting on using multiprocessing pool of workers along with BeautifulSoup so my program can run more faster (I have like 100,000 - 1,000,000 html files to open). The script i wrote is more complex but i have written down an small example down here. I am trying to do something like this and i keep getting the error
'RuntimeError: maximum recursion depth exceeded while pickling an object'
Edited Code
from bs4 import BeautifulSoup
from multiprocessing import Pool
def extraction(path):
soup=BeautifulSoup(open(path),"lxml")
return soup.title
pool=Pool(processes=4)
path=['/Volume3/2316/http/www.metro.co.uk/news/852300-haiti-quake-victim-footballers-stage-special-tournament/crawlerdefault.html','/Volume3/2316/http/presszoom.com/story_164020.html']
print pool.map(extraction,path)
pool.close()
pool.join()
After doing some search and digging through some posts, i got to know that the error is occurring because BeautifulSoup is exceeding the depth of the python interpreter stack. I tried to raise the limit and run the same program ( i went up to 3000) but the error remains same. I stopped raising the limit because the problem is with BeautifulSoup when opening the html files.
Using multiprocessing with BeautifulSoup will speed my execution time, but i am not able to figure out how to apply it to open the files.
Does anyone have any other approach on how to use BeautifulSoup with multiprocessing or how to come over these kind of errors ?
Any kind of help will be appreciated, i am sitting for hours trying to fix it and understand why i am getting the error.
Edit
I tested the above code with the files i have given in the paths and i got the same RuntimeError as above
The files can be accessed here (http://ec2-23-20-166-224.compute-1.amazonaws.com/sites/html_files/)
I think the reason is the returning of the whole soup.title
-object. It seems, it all the children
and parent
elements and their children and parents and so on are analysed in this moment, and this raises the recursion error.
If the content of the object is what you need, you can simply call the str-method:
return soup.title.__str__()
Unfortunatly, this means, you don’t have access to all the other information provided by the bs4-library anymore.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With