Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup returns empty list when searching by compound class names

BeautifulSoup returns empty list when searching by compound class names using regex.

Example:

import re
from bs4 import BeautifulSoup

bs = 
    """
    <a class="name-single name692" href="www.example.com"">Example Text</a>
    """

bsObj = BeautifulSoup(bs)

# this returns the class
found_elements = bsObj.find_all("a", class_= re.compile("^(name-single.*)$"))

# this returns an empty list
found_elements = bsObj.find_all("a", class_= re.compile("^(name-single name\d*)$"))

I need the class selection to be very precise. Any ideas?

like image 506
Ivan Bilan Avatar asked Dec 15 '15 12:12

Ivan Bilan


2 Answers

Unfortunately, when you try to make a regular expression match on a class attribute value that contains multiple classes, BeautifulSoup would apply the regular expression to every single class separately. Here are the relevant topics about the problem:

  • Python regular expression for Beautiful Soup
  • Multiple CSS class search is unhandy

This is all because class is a very special multi-valued attribute and every time you parse HTML, one of the BeautifulSoup's tree builders (depending on the parser choice) internally splits a class string value into a list of classes (quote from the HTMLTreeBuilder's docstring):

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.

There are multiple workarounds, but here is a hack-ish one - we are going to ask BeautifulSoup not to handle class as a multi-valued attribute by making our simple custom tree builder:

import re

from bs4 import BeautifulSoup
from bs4.builder._htmlparser import HTMLParserTreeBuilder


class MyBuilder(HTMLParserTreeBuilder):
    def __init__(self):
        super(MyBuilder, self).__init__()

        # BeautifulSoup, please don't treat "class" specially
        self.cdata_list_attributes["*"].remove("class")


bs = """<a class="name-single name692" href="www.example.com"">Example Text</a>"""
bsObj = BeautifulSoup(bs, "html.parser", builder=MyBuilder())
found_elements = bsObj.find_all("a", class_=re.compile(r"^name\-single name\d+$"))

print(found_elements)

In this case the regular expression would be applied to a class attribute value as a whole.


Alternatively, you can just parse the HTML with xml features enabled (if this is applicable):

soup = BeautifulSoup(data, "xml")

You can also use CSS selectors and match all elements with name-single class and a class staring with "name":

soup.select("a.name-single,a[class^=name]")

You can then apply the regular expression manually if needed:

pattern = re.compile(r"^name-single name\d+$")
for elm in bsObj.select("a.name-single,a[class^=name]"):
    match = pattern.match(" ".join(elm["class"]))
    if match:
        print(elm)
like image 131
alecxe Avatar answered Sep 28 '22 17:09

alecxe


For this use case I would simply use a custom filter, like so:

import re

from bs4 import BeautifulSoup
from bs4.builder._htmlparser import HTMLParserTreeBuilder

def myclassfilter(tag):
    return re.compile(r"^name\-single name\d+$").search(' '.join(tag['class']))

bs = """<a class="name-single name692" href="www.example.com"">Example Text</a>"""
bsObj = BeautifulSoup(bs, "html.parser")
found_elements = bsObj.find_all(myclassfilter)

print(found_elements)
like image 27
rll Avatar answered Sep 28 '22 17:09

rll