The Story:
When you parse HTML with BeautifulSoup
, class
attribute is considered a multi-valued attribute and is handled in a special manner:
Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes.
Also, a quote from a built-in HTMLTreeBuilder
used by BeautifulSoup
as a base for other tree builder classes, like, for instance, HTMLParserTreeBuilder
:
# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'. When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
The Question:
How can I configure BeautifulSoup
to handle class
as a usual single-valued attribute? In other words, I don't want it to handle class
specially and consider it a regular attribute.
FYI, here is one of the use-cases when it can be helpful:
What I've tried:
I've actually made it work by making a custom tree builder class and removing class
from the list of specially-handled attributes:
from bs4.builder._htmlparser import HTMLParserTreeBuilder
class MyBuilder(HTMLParserTreeBuilder):
def __init__(self):
super(MyBuilder, self).__init__()
# BeautifulSoup, please don't treat "class" specially
self.cdata_list_attributes["*"].remove("class")
soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())
What I don't like in this approach is that it is quite "unnatural" and "magical" involving importing "private" internal _htmlparser
. I hope there is a simpler way.
NOTE: I want to save all other HTML parsing related features, meaning I don't want to parse HTML
with "xml"-only features (which could've been another workaround).
' But be careful, if you want to change a class attribute, you have to do it with the notation ClassName. AttributeName. Otherwise, you will create a new instance variable.
Class attributes are variables of a class that are shared between all of its instances. They differ from instance attributes in that instance attributes are owned by one specific instance of the class only, and are not shared between instances.
Differences Between Class and Instance Attributes The difference is that class attributes are shared by all instances. When you change the value of a class attribute, it will affect all instances that share the same exact value. The attribute of an instance on the other hand is unique to that instance.
Attributes of a class can also be accessed using the following built-in methods and functions : getattr() – This function is used to access the attribute of object. hasattr() – This function is used to check if an attribute exist or not. setattr() – This function is used to set an attribute.
What I don't like in this approach is that it is quite "unnatural" and "magical" involving importing "private" internal
_htmlparser
. I hope there is a simpler way.
Yes, you can import it from bs4.builder
instead:
from bs4 import BeautifulSoup
from bs4.builder import HTMLParserTreeBuilder
class MyBuilder(HTMLParserTreeBuilder):
def __init__(self):
super(MyBuilder, self).__init__()
# BeautifulSoup, please don't treat "class" as a list
self.cdata_list_attributes["*"].remove("class")
soup = BeautifulSoup(data, "html.parser", builder=MyBuilder())
And if it's important enough that you don't want to repeat yourself, put the builder in its own module, and register it with register_treebuilders_from()
so that it takes precedence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With