How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?

Tags:

I'm using Django and Python 3.7. I want to have more efficient parsing so I was reading about SoupStrainer objects. I created a custom one to help me parse only the elements I need ...

def my_custom_strainer(self, elem, attrs):
    for attr in attrs:
        print("attr:" + attr + "=" + attrs[attr])
    if elem == 'div' and 'class' in attr and attrs['class'] == "score":
        return True
    elif elem == "span" and elem.text == re.compile("my text"):
        return True

article_stat_page_strainer = SoupStrainer(self.my_custom_strainer)
soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)

One of the conditions is I only want to parse "span" elements whose text matches a certain pattern. Hence the

elem == "span" and elem.text == re.compile("my text")

clause. However, this results in an

AttributeError: 'str' object has no attribute 'text'

error when I try and run the above. What's the proper way to write my strainer?

548

asked Feb 23 '19 03:02

Dave

2 Answers

TLDR; No, this is currently not easily possible in BeautifulSoup (modification of BeautifulSoup and SoupStrainer objects would be needed).

Explanation:

The problem is that the Strainer-passed function gets called on handle_starttag() method. As you can guess, you only have values in the opening tag (eg. element name and attrs).

https://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/init.py#L524

if (self.parse_only and len(self.tagStack) <= 1
    and (self.parse_only.text
     or not self.parse_only.search_tag(name, attrs))):
return None

And as you can see, if your Strainer function returns False, the element gets discarded immediately, without having chance to take the inner text inside into consideration (unfortunately).

On the other hand if you add "text" to search.

SoupStrainer(text="my text")

it will start to search inside the tag for text, but this doesn't have context of element or attributes - you can see the irony :/

and combining it together will just find nothing. And you can't even access parent like shown here in find function: https://gist.github.com/RichardBronosky/4060082

So currently Strainers are just good to filter on elements/attrs. You would need to change a lot of Beautiful soup code to get that working.

If you really need this, I suggest inheriting BeautifulSoup and SoupStrainer objects and modifying their behavior.

answered Sep 28 '22 09:09

darkless

It seems you try to loop along soup elements in my_custom_strainer method.

In order to do so, you could do it as follows:

soup = BeautifulSoup(html, features="html.parser", parse_only=article_stat_page_strainer)
my_custom_strainer(soup, attrs)

Then slightly modify my_custom_strainer to meet something like:

def my_custom_strainer(soup, attrs):
  for attr in attrs:
    print("attr:" + attr + "=" + attrs[attr])
  for d in soup.findAll(['div','span']):
    if d.name == 'span' and 'class' in attr and attrs['class'] == "score":
      return d.text # meet your needs here
   elif d.name == 'span' and d.text == re.compile("my text"):
      return d.text # meet your needs here

This way you can access the soup objects iteratively.

answered Sep 28 '22 07:09

Evhz

Related questions
                            
                                FutureWarning with distplot in seaborn [duplicate]
                            
                                Django Autocomplete Light create new choice
                            
                                Convert Pytorch Tensor to Numpy Array using Cuda
                            
                                Modify JSON in Ansible
                            
                                Tensorflow Object Detection API - 'ValueError: anchor_strides must be a list with the same length as self._box_specs'
                            
                                Spotify API {'error': 'invalid_client'} Authorization Code Flow [400]
                            
                                How to encircle some pixels on a heat map with a continuous, not branched line using Python?
                            
                                How to specify Accept headers from rest_framework.test.Client?
                            
                                Project Euler # 11 Numpy way
                            
                                How to use TensorFlow tf.print with non capital p?
                            
                                Django Admin List Filter Remove All Option
                            
                                How to cut a list by specific item?
                            
                                How to save pandas to excel with different colors
                            
                                Cannot load mkl_intel_thread.dll on python executable
                            
                                How to assign random values from a list to a column in a pandas dataframe?
                            
                                MySQL One-to-Many to JSON format
                            
                                When to use dynamodb.client, dynamodb.resource and dynamodb.Table?
                            
                                how to write gray (1-channel) image with opencv for python
                            
                                Can't connect to mysql db withh python - bad handshake
                            
                                Column-dependent bounds in torch.clamp

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I write a BeautifulSoup strainer that only parses objects with certain text between the tags?

Tags:

python

python-3.x

parsing

beautifulsoup

django

Dave

People also ask

2 Answers

darkless

Evhz

Recent Activity

Donate For Us