I'm using Django and Python 3.7 . I want to speed up my HTML parsing. Currently, I'm looking for three types of elements in my document, like so
req = urllib2.Request(fullurl, headers=settings.HDR)
html = urllib2.urlopen(req).read()
comments_soup = BeautifulSoup(html, features="html.parser")
score_elts = comments_soup.findAll("div", {"class": "score"})
comments_elts = comments_soup.findAll("a", attrs={'class': 'comments'})
bad_elts = comments_soup.findAll("span", text=re.compile("low score"))
I have read that SoupStrainer is one way to improve performacne -- https://www.crummy.com/software/BeautifulSoup/bs4/doc/#parsing-only-part-of-a-document . However, all the examples only talk about parsing an HTML doc with a single strainer. In my case, I have three. How can I pass three strainers into my parsing, or would that actually create worse performance that just doing it the way I'm doing it now?
I don't think you can pass multiple Strainers into the BeautifulSoup constructor. What you can instead do is to wrap all your conditions into one Strainer and pass it to the BeautifulSoup Constructor.
For simple cases such as just the tag names, you can pass a list into the SoupStrainer
html="""
<a>yes</a>
<p>yes</p>
<span>no</span>
"""
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
custom_strainer = SoupStrainer(["a","p"])
soup=BeautifulSoup(html, "lxml", parse_only=custom_strainer)
print(soup)
Output
<a>yes</a><p>yes</p>
For specifying some more logic, you can also pass in a custom function(you may have to do this).
html="""
<html class="test">
<a class="wanted">yes</a>
<a class="not-wanted">no</a>
<p>yes</p>
<span>no</span>
</html>
"""
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
def my_function(elem,attrs):
if elem=='a' and attrs['class']=="wanted":
return True
elif elem=='p':
return True
custom_strainer= SoupStrainer(my_function)
soup=BeautifulSoup(html, "lxml", parse_only=custom_strainer)
print(soup)
Output
<a class="wanted">yes</a><p>yes</p>
As specified in the documentation
Parsing only part of a document won’t save you much time parsing the document, but it can save a lot of memory, and it’ll make searching the document much faster.
I think you should check out the Improving performance section of the documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With