I can't avoid the maximum recursion depth Python RuntimeError using BeautifulSoup.
I'm trying to recurse over nested sections of code and pull out the content. The prettified HTML looks like this (don't ask why it looks like this :)):
<div><code><code><code><code>Code in here</code></code></code></code></div>
The function I'm passing my soup object to is:
def _strip_descendent_code(self, soup):
sys.setrecursionlimit(2000)
# soup = BeautifulSoup(html, 'lxml')
for code in soup.findAll('code'):
s = ""
for c in code.descendents:
if not isinstance(c, NavigableString):
if c.name != code.name:
continue
elif c.name == code.name:
if isinstance(c, NavigableString):
s += str(c)
else:
continue
code.append(s)
return str(soup)
You can see I'm trying to increase the default recursion limit but this is not a solution. I've increased up to the point that C hits the memory limit on computer, and the function above never works.
Any help to get this to work and point out the error/s would be much appreciated.
The stack trace repeats this:
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 529, in _find_all
i = next(generator)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1269, in descendants
stopNode = self._last_descendant().next_element
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 284, in _last_descendant
if is_initialized and self.next_sibling:
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 997, in __getattr__
return self.find(tag)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 529, in _find_all
i = next(generator)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1269, in descendants
stopNode = self._last_descendant().next_element
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 284, in _last_descendant
if is_initialized and self.next_sibling:
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 997, in __getattr__
return self.find(tag)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
return self._find_all(name, attrs, text, limit, generator, **kwargs)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 512, in _find_all
strainer = SoupStrainer(name, attrs, text, **kwargs)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1548, in __init__
self.text = self._normalize_search_value(text)
File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1553, in _normalize_search_value
if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value, 'match')
RuntimeError: maximum recursion depth exceeded while calling a Python object
I had encountered this problem and browsed a lot of web pages. I summary two methods to solve this problem.
However, I think we should know why that happened. Python limits the number of recursive(default number is 1000). We can see this number with print sys.getrecursionlimit()
. I guess that BeautifulSoup uses recursion to find child elements. When recursion is more than 1000 times, RuntimeError: maximum recursion depth exceeded
will appear.
First method: use sys.setrecursionlimit()
set limited number of recursive. You obviously can set 1000000, but maybe cause segmentation fault
.
Second Method: use try-except
. If appeared maximum recursion depth exceeded
, Our algorithm might have problems. Generally speaking, we can use loops instead of recursion. In your question, we could deal with HTML with replace()
or regular expression in advance.
Finally, I give an example.
from bs4 import BeautifulSoup
import sys
#sys.setrecursionlimit(10000)
try:
doc = ''.join(['<br>' for x in range(1000)])
soup = BeautifulSoup(doc, 'html.parser')
a = soup.find('br')
for i in a:
print i
except:
print 'failed'
If removed the #
, it could print doc
.
Hoping to help you.
I'm unsure about why this works (I haven't examined the source), but adding .text
or .get_text()
seems to bypass the error for me.
For instance, changing
lambda x: BeautifulSoup(x, 'html.parser')
to
lambda x: BeautifulSoup(x, 'html.parser').get_text()
seems to work without throwing a recursion depth error.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With