Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?

I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...

def my_custom_strainer(self, elem, attrs):
    for attr in attrs:
        print("attr:" + attr + "=" + attrs[attr])
    if elem == 'div' and 'class' in attr and attrs['class'] == "score":
        return True
    elif elem == "span" and elem.text == re.compile("my text"):
        return True

article_stat_page_strainer = SoupStrainer(self.my_custom_strainer)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)

One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the

elem == "span" and elem.text == re.compile("my text")

clause. However, this results in an

AttributeError: 'str' object has no attribute 'text'

error when I try and run the above. What's the proper way to write my strainer?

like image 548
Dave Avatar asked Feb 23 '19 03:02

Dave


People also ask

Which of these classes can be used to parse a part of a document select the correct answer BeautifulSoup parse only SoupStrainer none of the above?

The SoupStrainer class in Beautifulsoup allows you to parse only specific part of an incoming document.

Is parse an object of BeautifulSoup?

BeautifulSoup Object: The BeautifulSoup object represents the parsed document as a whole. So, it is the complete document which we are trying to scrape. For most purposes, you can treat it as a Tag object.


2 Answers

TLDR; No, this is currently not easily possible in BeautifulSoup (modification of BeautifulSoup and SoupStrainer objects would be needed).

Explanation:

The problem is that the Strainer-passed function gets called on handle_starttag() method. As you can guess, you only have values in the opening tag (eg. element name and attrs).

https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524

if (self.parse_only and len(self.tagStack) <= 1
    and (self.parse_only.text
     or not self.parse_only.search_tag(name, attrs))):
return None

And as you can see, if your Strainer function returns False, the element gets discarded immediately, without having chance to take the inner text inside into consideration (unfortunately).

On the other hand if you add "text" to search.

SoupStrainer(text="my text")

it will start to search inside the tag for text, but this doesn't have context of element or attributes - you can see the irony :/

and combining it together will just find nothing. And you can't even access parent like shown here in find function: https://gist.github.com/RichardBronosky/4060082

So currently Strainers are just good to filter on elements/attrs. You would need to change a lot of Beautiful soup code to get that working.

If you really need this, I suggest inheriting BeautifulSoup and SoupStrainer objects and modifying their behavior.

like image 63
darkless Avatar answered Sep 28 '22 09:09

darkless


It seems you try to loop along soup elements in my_custom_strainer method.

In order to do so, you could do it as follows:

soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
my_custom_strainer(soup, attrs)

Then slightly modify my_custom_strainer to meet something like:

def my_custom_strainer(soup, attrs):
  for attr in attrs:
    print("attr:" + attr + "=" + attrs[attr])
  for d in soup.findAll(['div','span']):
    if d.name == 'span' and 'class' in attr and attrs['class'] == "score":
      return d.text # meet your needs here
   elif d.name == 'span' and d.text == re.compile("my text"):
      return d.text # meet your needs here

This way you can access the soup objects iteratively.

like image 41
Evhz Avatar answered Sep 28 '22 07:09

Evhz