Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Possible to use multiple strainers with one BeautifulSoup document?

I'm using Django and Python 3.7 . I want to speed up my HTML parsing. Currently, I'm looking for three types of elements in my document, like so

req = urllib2.Request(fullurl, headers=settings.HDR)
html = urllib2.urlopen(req).read()
comments_soup = BeautifulSoup(html, features="html.parser")

score_elts = comments_soup.findAll("div", {"class": "score"})

comments_elts = comments_soup.findAll("a", attrs={'class': 'comments'})

bad_elts = comments_soup.findAll("span", text=re.compile("low score"))

I have read that SoupStrainer is one way to improve performacne -- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document . However, all the examples only talk about parsing an HTML doc with a single strainer. In my case, I have three. How can I pass three strainers into my parsing, or would that actually create worse performance that just doing it the way I'm doing it now?

like image 693
Dave Avatar asked Dec 04 '25 03:12

Dave


1 Answers

I don't think you can pass multiple Strainers into the BeautifulSoup constructor. What you can instead do is to wrap all your conditions into one Strainer and pass it to the BeautifulSoup Constructor.

For simple cases such as just the tag names, you can pass a list into the SoupStrainer

html="""
<a>yes</a>
<p>yes</p>
<span>no</span>
"""
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
custom_strainer = SoupStrainer(["a","p"])
soup=BeautifulSoup(html, "lxml", parse_only=custom_strainer)
print(soup)

Output

<a>yes</a><p>yes</p>

For specifying some more logic, you can also pass in a custom function(you may have to do this).

html="""
<html class="test">
<a class="wanted">yes</a>
<a class="not-wanted">no</a>
<p>yes</p>
<span>no</span>
</html>
"""
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
def my_function(elem,attrs):
    if elem=='a' and attrs['class']=="wanted":
        return True
    elif elem=='p':
        return True
custom_strainer= SoupStrainer(my_function)
soup=BeautifulSoup(html, "lxml", parse_only=custom_strainer)
print(soup)

Output

<a class="wanted">yes</a><p>yes</p>

As specified in the documentation

Parsing only part of a document won’t save you much time parsing the document, but it can save a lot of memory, and it’ll make searching the document much faster.

I think you should check out the Improving performance section of the documentation.

like image 55
Bitto Bennichan Avatar answered Dec 05 '25 17:12

Bitto Bennichan