I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...
def my_custom_strainer(self, elem, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
if elem == 'div' and 'class' in attr and attrs['class'] == "score":
return True
elif elem == "span" and elem.text == re.compile("my text"):
return True
article_stat_page_strainer = SoupStrainer(self.my_custom_strainer)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the
elem == "span" and elem.text == re.compile("my text")
clause. However, this results in an
AttributeError: 'str' object has no attribute 'text'
error when I try and run the above. What's the proper way to write my strainer?
The SoupStrainer class in Beautifulsoup allows you to parse only specific part of an incoming document.
BeautifulSoup Object: The BeautifulSoup object represents the parsed document as a whole. So, it is the complete document which we are trying to scrape. For most purposes, you can treat it as a Tag object.
TLDR; No, this is currently not easily possible in BeautifulSoup (modification of BeautifulSoup and SoupStrainer objects would be needed).
Explanation:
The problem is that the Strainer-passed function gets called on handle_starttag()
method. As you can guess, you only have values in the opening tag (eg. element name and attrs).
https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524
if (self.parse_only and len(self.tagStack) <= 1
and (self.parse_only.text
or not self.parse_only.search_tag(name, attrs))):
return None
And as you can see, if your Strainer function returns False, the element gets discarded immediately, without having chance to take the inner text inside into consideration (unfortunately).
On the other hand if you add "text" to search.
SoupStrainer(text="my text")
it will start to search inside the tag for text, but this doesn't have context of element or attributes - you can see the irony :/
and combining it together will just find nothing. And you can't even access parent like shown here in find function: https://gist.github.com/RichardBronosky/4060082
So currently Strainers are just good to filter on elements/attrs. You would need to change a lot of Beautiful soup code to get that working.
If you really need this, I suggest inheriting BeautifulSoup and SoupStrainer objects and modifying their behavior.
It seems you try to loop along soup elements in my_custom_strainer
method.
In order to do so, you could do it as follows:
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
my_custom_strainer(soup, attrs)
Then slightly modify my_custom_strainer
to meet something like:
def my_custom_strainer(soup, attrs):
for attr in attrs:
print("attr:" + attr + "=" + attrs[attr])
for d in soup.findAll(['div','span']):
if d.name == 'span' and 'class' in attr and attrs['class'] == "score":
return d.text # meet your needs here
elif d.name == 'span' and d.text == re.compile("my text"):
return d.text # meet your needs here
This way you can access the soup objects iteratively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With