Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Maximum recursion depth exceeded. Multiprocessing and bs4

I'm trying to make a parser use beautifulSoup and multiprocessing. I have an error:

RecursionError: maximum recursion depth exceeded

My code is:

import bs4, requests, time
from multiprocessing.pool import Pool

html = requests.get('https://www.avito.ru/moskva/avtomobili/bmw/x6?sgtd=5&radius=0')
soup = bs4.BeautifulSoup(html.text, "html.parser")

divList = soup.find_all("div", {'class': 'item_table-header'})


def new_check():
    with Pool() as pool:
        pool.map(get_info, divList)

def get_info(each):
   pass

if __name__ == '__main__':
    new_check()

Why I get this error and how I can fix it?

UPDATE: All text of error is

Traceback (most recent call last):
  File "C:/Users/eugen/PycharmProjects/avito/main.py", line 73, in <module> new_check()
  File "C:/Users/eugen/PycharmProjects/avito/main.py", line 67, in new_check
    pool.map(get_info, divList)
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 644, in get
    raise self._value
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 424, in _handle_tasks
    put(task)
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "C:\Users\eugen\AppData\Local\Programs\Python\Python36\lib\multiprocessing\reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
RecursionError: maximum recursion depth exceeded
like image 945
SBrain Avatar asked Aug 25 '18 21:08

SBrain


1 Answers

When you use multiprocessing, everything you pass to a worker has to be pickled.

Unfortunately, many BeautifulSoup trees can't be pickled.


There are a few different reasons for this. Some of them are bugs that have since been fixed, so you could try making sure you have the latest bs4 version, and some are specific to different parsers or tree builders… but there's a good chance nothing like this will help.

But the fundamental problem is that many elements in the tree contain references to the rest of the tree.

Occasionally, this leads to an actual infinite loop, because the circular references are too indirect for its circular reference detection. But that's usually a bug that gets fixed.

But, even more importantly, even when the loop isn't infinite, it can still drag in more than 1000 elements from all over the rest of the tree, and that's already enough to cause a RecursionError.

And I think the latter is what's happening here. If I take your code and try to pickle divList[0], it fails. (If I bump the recursion limit way up and count the frames, it needs a depth of 23080, which is way, way past the default of 1000.) But if I take that exact same div and parse it separately, it succeeds with no problem.


So, one possibility is to just do sys.setrecursionlimit(25000). That will solve the problem for this exact page, but a slightly different page might need even more than that. (Plus, it's usually not a great idea to set the recursion limit that high—not so much because of the wasted memory, but because it means actual infinite recursion takes 25x as long, and 25x as much wasted resources, to detect.)


Another trick is to write code that "prunes the tree", eliminating any upward links from the div before/as you pickle it. This is a great solution, except that it might be a lot of work, and requires diving into the internals of how BeautifulSoup works, which I doubt you want to do.


The easiest workaround is a bit clunky, but… you can convert the soup to a string, pass that to the child, and have the child re-parse it:

def new_check():
    divTexts = [str(div) for div in divList]
    with Pool() as pool:
        pool.map(get_info, divTexts)

def get_info(each):
    div = BeautifulSoup(each, 'html.parser')

if __name__ == '__main__':
    new_check()

The performance cost for doing this is probably not going to matter; the bigger worry is that if you had imperfect HTML, converting to a string and re-parsing it might not be a perfect round trip. So, I'd suggest that you do some tests without multiprocessing first to make sure this doesn't affect the results.

like image 118
abarnert Avatar answered Nov 03 '22 23:11

abarnert