Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup `find_all` generator

Is there any way to turn find_all into a more memory efficient generator? For example:

Given:

soup = BeautifulSoup(content, "html.parser")
return soup.find_all('item')

I would like to instead use:

soup = BeautifulSoup(content, "html.parser")
while True:
    yield soup.next_item_generator()

(assume proper handing of the final StopIteration exception)

There are some generators built in, but not to yield the next result in a find. find returns just the first item. With thousands of items, find_all sucks up a lot of memory. For 5792 items, I'm seeing a spike of just over 1GB of RAM.

I am well aware that there are more efficient parsers, such as lxml, that can accomplish this. Let's assume that there are other business constraints preventing me from using anything else.

How can I turn find_all into a generator to iterate through in a more memory efficient way.

like image 222
Jamie Counsell Avatar asked Dec 29 '16 02:12

Jamie Counsell


2 Answers

The simplest method is to use find_next:

soup = BeautifulSoup(content, "html.parser")

def find_iter(tagname):
    tag = soup.find(tagname)
    while tag is not None:
        yield tag
        tag = tag.find_next(tagname)
like image 86
ekhumoro Avatar answered Sep 22 '22 03:09

ekhumoro


There is no "find" generator in BeautifulSoup, from what I know, but we can combine the use of SoupStrainer and .children generator.

Let's imagine we have this sample HTML:

<div>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
    <item>Item 4</item>
    <item>Item 5</item>
</div>

from which we need to get the text of all item nodes.

We can use the SoupStrainer to parse only the item tags and then iterate over the .children generator and get the texts:

from bs4 import BeautifulSoup, SoupStrainer

data = """
<div>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
    <item>Item 4</item>
    <item>Item 5</item>
</div>"""

parse_only = SoupStrainer('item')
soup = BeautifulSoup(data, "html.parser", parse_only=parse_only)
for item in soup.children:
    print(item.get_text())

Prints:

Item 1
Item 2
Item 3
Item 4
Item 5

In other words, the idea is to cut the tree down to the desired tags and use one of the available generators, like .children. You can also use one of these generators directly and manually filter the tag by name or other criteria inside the generator body, e.g. something like:

def generate_items(soup):
    for tag in soup.descendants:
        if tag.name == "item":
            yield tag.get_text()

The .descendants generates the children elements recursively, while .children would only consider direct children of a node.

like image 42
alecxe Avatar answered Sep 20 '22 03:09

alecxe