The Question: <code>BeautifulSoup</code> provides a very limited support for CSS selectors. For instance, the only supported pseudo-class is <code>nth-of-type</code> and it can only accept numerical values - arguments like <code>even</code> or <code>odd</code> are not allowed. Is it possible to extend <code>BeautifulSoup</code> CSS selectors or let it use <code>lxml.cssselect</code> internally as an underlying CSS selection mechanism? <hr> Let's take a look at an example problem/use case. Locate only even rows in the following HTML: <pre class="prettyprint"><code><table> <tr> <td>1</td> <tr> <td>2</td> </tr> <tr> <td>3</td> </tr> <tr> <td>4</td> </tr> </table> </code></pre> In <code>lxml.html</code> and <code>lxml.cssselect</code>, it is easy to do via <code>:nth-of-type(even)</code>: <pre class="prettyprint"><code>from lxml.html import fromstring from lxml.cssselect import CSSSelector tree = fromstring(data) sel = CSSSelector('tr:nth-of-type(even)') print [e.text_content().strip() for e in sel(tree)] </code></pre> But, in <code>BeautifulSoup</code>: <pre class="prettyprint"><code>print(soup.select("tr:nth-of-type(even)")) </code></pre> would throw an error: <blockquote> NotImplementedError: Only numeric values are currently supported for the nth-of-type pseudo-class. </blockquote> <hr> Note that we can workaround it with <code>.find_all()</code>: <pre class="prettyprint"><code>print([row.get_text(strip=True) for index, row in enumerate(soup.find_all("tr"), start=1) if index % 2 == 0]) </code></pre>

Officially, Beautifulsoup doesn't support all the CSS selectors. If python is not the only choice, i strongly recommend JSoup (the java equivalent of this). It supports all the CSS selectors. <ul> <li>It is open source (MIT license)</li> <li>Syntax is easy</li> <li>Supports all the css selectors</li> <li>Can span multiple threads too to scale up </li> <li>Rich API support in java to store in DBs. So, it is easy to integrate.</li> </ul> The other alternate way if you still want to stick with python, make it a jython implementation. http://jsoup.org/ https://github.com/jhy/jsoup/

Extending CSS selectors in BeautifulSoup

Tags:

python

css-selectors

html-parsing

beautifulsoup

lxml.html

The Question:

BeautifulSoup provides a very limited support for CSS selectors. For instance, the only supported pseudo-class is nth-of-type and it can only accept numerical values - arguments like even or odd are not allowed.

Is it possible to extend BeautifulSoup CSS selectors or let it use lxml.cssselect internally as an underlying CSS selection mechanism?

Let's take a look at an example problem/use case. Locate only even rows in the following HTML:

Click to copy

<table>
    <tr>
        <td>1</td>
    <tr>
        <td>2</td>
    </tr>
    <tr>
        <td>3</td>
    </tr>
    <tr>
        <td>4</td>
    </tr>
</table>

In lxml.html and lxml.cssselect, it is easy to do via :nth-of-type(even):

Click to copy

from lxml.html import fromstring
from lxml.cssselect import CSSSelector

tree = fromstring(data)

sel = CSSSelector('tr:nth-of-type(even)')

print [e.text_content().strip() for e in sel(tree)]

But, in BeautifulSoup:

Click to copy

print(soup.select("tr:nth-of-type(even)"))

would throw an error:

NotImplementedError: Only numeric values are currently supported for the nth-of-type pseudo-class.

Note that we can workaround it with .find_all():

Click to copy

print([row.get_text(strip=True) for index, row in enumerate(soup.find_all("tr"), start=1) if index % 2 == 0])

859

asked Dec 21 '15 03:12

alecxe

2 Answers

After checking the source code, it seems that BeautifulSoup does not provide any convenient point in its interface to extend or monkey patch its existing functionality in this regard. Using functionality from lxml is not possible either since BeautifulSoup only uses lxml during parsing and uses the parsing results to create its own respective objects from them. The lxml objects are not preserved and cannot be accessed later.

That being said, with enough determination and with the flexibility and introspection capabilities of Python, anything is possible. You can modify the BeautifulSoup method internals even at run-time:

Click to copy

import inspect
import re
import textwrap

import bs4.element


def replace_code_lines(source, start_token, end_token,
                       replacement, escape_tokens=True):
    """Replace the source code between `start_token` and `end_token`
    in `source` with `replacement`. The `start_token` portion is included
    in the replaced code. If `escape_tokens` is True (default),
    escape the tokens to avoid them being treated as a regular expression."""

    if escape_tokens:
        start_token = re.escape(start_token)
        end_token = re.escape(end_token)

    def replace_with_indent(match):
        indent = match.group(1)
        return textwrap.indent(replacement, indent)

    return re.sub(r"^(\s+)({}[\s\S]+?)(?=^\1{})".format(start_token, end_token),
                  replace_with_indent, source, flags=re.MULTILINE)


# Get the source code of the Tag.select() method
src = textwrap.dedent(inspect.getsource(bs4.element.Tag.select))

# Replace the relevant part of the method
start_token = "if pseudo_type == 'nth-of-type':"
end_token = "else"
replacement = """\
if pseudo_type == 'nth-of-type':
    try:
        if pseudo_value in ("even", "odd"):
            pass
        else:
            pseudo_value = int(pseudo_value)
    except:
        raise NotImplementedError(
            'Only numeric values, "even" and "odd" are currently '
            'supported for the nth-of-type pseudo-class.')
    if isinstance(pseudo_value, int) and pseudo_value < 1:
        raise ValueError(
            'nth-of-type pseudo-class value must be at least 1.')
    class Counter(object):
        def __init__(self, destination):
            self.count = 0
            self.destination = destination

        def nth_child_of_type(self, tag):
            self.count += 1
            if pseudo_value == "even":
                return not bool(self.count % 2)
            elif pseudo_value == "odd":
                return bool(self.count % 2)
            elif self.count == self.destination:
                return True
            elif self.count > self.destination:
                # Stop the generator that's sending us
                # these things.
                raise StopIteration()
            return False
    checker = Counter(pseudo_value).nth_child_of_type
"""
new_src = replace_code_lines(src, start_token, end_token, replacement)

# Compile it and execute it in the target module's namespace
exec(new_src, bs4.element.__dict__)
# Monkey patch the target method
bs4.element.Tag.select = bs4.element.select

This is the portion of code being modified.

Of course, this is everything but elegant and reliable. I don't envision this being seriously used anywhere, ever.

answered Oct 13 '22 14:10

Martin Valgur

Officially, Beautifulsoup doesn't support all the CSS selectors.

If python is not the only choice, i strongly recommend JSoup (the java equivalent of this). It supports all the CSS selectors.

It is open source (MIT license)
Syntax is easy
Supports all the css selectors
Can span multiple threads too to scale up
Rich API support in java to store in DBs. So, it is easy to integrate.

The other alternate way if you still want to stick with python, make it a jython implementation.

http://jsoup.org/

https://github.com/jhy/jsoup/

answered Oct 13 '22 13:10

vivek_nk

Related questions
                            
                                Best practice to write logs in /var/log from a python script?
                            
                                How to force application version on AWS Elastic Beanstalk
                            
                                How to dynamically add and load entry points?
                            
                                numpy: "size" vs. "shape" in function arguments?
                            
                                Graph databases and RDF triplestores: storage of graph data in python
                            
                                Django sub-applications & module structure
                            
                                How can I catch SIGINT in threading python program?
                            
                                Ways to make a class immutable in Python
                            
                                Set up a real timeout for loading page in Selenium WebDriver?
                            
                                How to prevent python from using orphaned .pyc files? (ones with no matching .py files)
                            
                                Display all jinja object attributes
                            
                                running a process as a different user from Python [duplicate]
                            
                                Using Python “requests” with existing socket connection
                            
                                Creating single EXE using py2exe for a Tkinter program
                            
                                If I send a python 'Signal' object from a function, what should the "sender" argument be?
                            
                                `pip install pandas` gives UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 41: ordinal not in range(128)
                            
                                Filter list of strings, ignoring substrings of other items
                            
                                How to concatenate several parametrized fixtures into a new fixture in py.test?
                            
                                requests - how to stream upload - partial file
                            
                                How do I copy a row from one pandas dataframe to another pandas dataframe?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Extending CSS selectors in BeautifulSoup

Tags:

python

css-selectors

html-parsing

beautifulsoup

lxml.html

alecxe

People also ask

2 Answers

Martin Valgur

vivek_nk

Recent Activity

Donate For Us