Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup: RuntimeError: maximum recursion depth exceeded

I can't avoid the maximum recursion depth Python RuntimeError using BeautifulSoup.

I'm trying to recurse over nested sections of code and pull out the content. The prettified HTML looks like this (don't ask why it looks like this :)):

<div><code><code><code><code>Code in here</code></code></code></code></div>

The function I'm passing my soup object to is:

def _strip_descendent_code(self, soup):
    sys.setrecursionlimit(2000)
    # soup = BeautifulSoup(html, 'lxml')
    for code in soup.findAll('code'):
        s = ""
        for c in code.descendents:
            if not isinstance(c, NavigableString):
                if c.name != code.name:
                    continue
                elif c.name == code.name:
                    if isinstance(c, NavigableString):
                        s += str(c)
                    else:
                        continue
        code.append(s)
    return str(soup)

You can see I'm trying to increase the default recursion limit but this is not a solution. I've increased up to the point that C hits the memory limit on computer, and the function above never works.

Any help to get this to work and point out the error/s would be much appreciated.

The stack trace repeats this:

  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 529, in _find_all
    i = next(generator)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1269, in descendants
    stopNode = self._last_descendant().next_element
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 284, in _last_descendant
    if is_initialized and self.next_sibling:
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 997, in __getattr__
    return self.find(tag)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 529, in _find_all
    i = next(generator)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1269, in descendants
    stopNode = self._last_descendant().next_element
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 284, in _last_descendant
    if is_initialized and self.next_sibling:
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 997, in __getattr__
    return self.find(tag)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1234, in find
    l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1255, in find_all
    return self._find_all(name, attrs, text, limit, generator, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 512, in _find_all
    strainer = SoupStrainer(name, attrs, text, **kwargs)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1548, in __init__
    self.text = self._normalize_search_value(text)
  File "/Users/almccann/.virtualenvs/evernoteghost/lib/python3.4/site-packages/bs4/element.py", line 1553, in _normalize_search_value
    if (isinstance(value, str) or isinstance(value, collections.Callable) or hasattr(value, 'match')
RuntimeError: maximum recursion depth exceeded while calling a Python object
like image 388
almccann Avatar asked Jul 21 '15 00:07

almccann


2 Answers

I had encountered this problem and browsed a lot of web pages. I summary two methods to solve this problem.

However, I think we should know why that happened. Python limits the number of recursive(default number is 1000). We can see this number with print sys.getrecursionlimit(). I guess that BeautifulSoup uses recursion to find child elements. When recursion is more than 1000 times, RuntimeError: maximum recursion depth exceeded will appear.

First method: use sys.setrecursionlimit() set limited number of recursive. You obviously can set 1000000, but maybe cause segmentation fault.

Second Method: use try-except. If appeared maximum recursion depth exceeded, Our algorithm might have problems. Generally speaking, we can use loops instead of recursion. In your question, we could deal with HTML with replace() or regular expression in advance.

Finally, I give an example.

from bs4 import BeautifulSoup
import sys   
#sys.setrecursionlimit(10000)

try:
    doc = ''.join(['<br>' for x in range(1000)])
    soup = BeautifulSoup(doc, 'html.parser')
    a = soup.find('br')
    for i in a:
        print i
except:
    print 'failed'

If removed the #, it could print doc.

Hoping to help you.

like image 91
Absinth Avatar answered Oct 18 '22 10:10

Absinth


I'm unsure about why this works (I haven't examined the source), but adding .text or .get_text() seems to bypass the error for me.

For instance, changing

lambda x: BeautifulSoup(x, 'html.parser')

to

lambda x: BeautifulSoup(x, 'html.parser').get_text() seems to work without throwing a recursion depth error.

like image 3
ngopal Avatar answered Oct 18 '22 10:10

ngopal