Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Disable special "class" attribute handling

The Story:

When you parse HTML with BeautifulSoup, class attribute is considered a multi-valued attribute and is handled in a special manner:

Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes.

Also, a quote from a built-in HTMLTreeBuilder used by BeautifulSoup as a base for other tree builder classes, like, for instance, HTMLParserTreeBuilder:

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.

The Question:

How can I configure BeautifulSoup to handle class as a usual single-valued attribute? In other words, I don't want it to handle class specially and consider it a regular attribute.

FYI, here is one of the use-cases when it can be helpful:

  • BeautifulSoup returns empty list when searching by compound class names

What I've tried:

I've actually made it work by making a custom tree builder class and removing class from the list of specially-handled attributes:

from bs4.builder._htmlparser import HTMLParserTreeBuilder

class MyBuilder(HTMLParserTreeBuilder):
    def __init__(self):
        super(MyBuilder, self).__init__()

        # BeautifulSoup, please don't treat "class" specially
        self.cdata_list_attributes["*"].remove("class")


soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())

What I don't like in this approach is that it is quite "unnatural" and "magical" involving importing "private" internal _htmlparser. I hope there is a simpler way.

NOTE: I want to save all other HTML parsing related features, meaning I don't want to parse HTML with "xml"-only features (which could've been another workaround).

like image 542
alecxe Avatar asked Dec 15 '15 17:12

alecxe


People also ask

How do I change class attributes in Python?

' But be careful, if you want to change a class attribute, you have to do it with the notation ClassName. AttributeName. Otherwise, you will create a new instance variable.

What is a class attribute Python?

Class attributes are variables of a class that are shared between all of its instances. They differ from instance attributes in that instance attributes are owned by one specific instance of the class only, and ​are not shared between instances.

What is the difference between class attributes and instance attributes?

Differences Between Class and Instance Attributes The difference is that class attributes are shared by all instances. When you change the value of a class attribute, it will affect all instances that share the same exact value. The attribute of an instance on the other hand is unique to that instance.

How do you access the attributes of a class in Python?

Attributes of a class can also be accessed using the following built-in methods and functions : getattr() – This function is used to access the attribute of object. hasattr() – This function is used to check if an attribute exist or not. setattr() – This function is used to set an attribute.


1 Answers

What I don't like in this approach is that it is quite "unnatural" and "magical" involving importing "private" internal _htmlparser. I hope there is a simpler way.

Yes, you can import it from bs4.builder instead:

from bs4 import BeautifulSoup
from bs4.builder import HTMLParserTreeBuilder

class MyBuilder(HTMLParserTreeBuilder):
    def __init__(self):
        super(MyBuilder, self).__init__()
        # BeautifulSoup, please don't treat "class" as a list
        self.cdata_list_attributes["*"].remove("class")


soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())

And if it's important enough that you don't want to repeat yourself, put the builder in its own module, and register it with register_treebuilders_from() so that it takes precedence.

like image 153
dnozay Avatar answered Oct 09 '22 11:10

dnozay