Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dealing with a colon in BeautifulSoup CSS selectors

Input HTML:

<div style="display: flex">
    <div class="half" style="font-size: 0.8em;width: 33%;"> apple </div>
    <div class="half" style="font-size: 0.8em;text-align: center;width: 28%;"> peach </div>
    <div class="half" style="font-size: 0.8em;text-align: right;width: 33%;" title="nofruit"> cucumber </div>
</div>

The desired output: all div elements exactly under <div style="display: flex">.

I'm trying to locate the parent div with a CSS selector:

div[style="display: flex"]

This throws an error:

>>> soup.select('div[style="display: flex"]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/bs4/element.py", line 1400, in select
    'Only the following pseudo-classes are implemented: nth-of-type.')
NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.

It looks like BeautifulSoup tries to interpret the colon as a pseudo-class syntax.

I've tried to follow the advices suggested at Handling a colon in an element ID in a CSS selector, but it still throws errors:

>>> soup.select('div[style="display\: flex"]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/bs4/element.py", line 1400, in select
    'Only the following pseudo-classes are implemented: nth-of-type.')
NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type.
>>> soup.select('div[style="display\3A flex"]')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/.virtualenvs/so/lib/python2.7/site-packages/bs4/element.py", line 1426, in select
    'Unsupported or invalid CSS selector: "%s"' % token)
ValueError: Unsupported or invalid CSS selector: "div[style="displayA"

The Question:

What is the correct way to use/escape a colon in attribute values in BeautifulSoup CSS selectors?


Note that I can workaround it with a partial attribute match:

soup.select("div[style$=flex]")

Or, with a find_all():

soup.find_all("div", style="display: flex")

Also note that I understand that using style to locate elements is far from being a good location technique, but the question itself is meant to be generic and the provided HTML is just an example.

like image 713
alecxe Avatar asked Jan 01 '16 04:01

alecxe


1 Answers

Update: the issue is now fixed in BeautifulSoup 4.5.0, upgrade if needed:

pip install --upgrade beautifulsoup4

Old answer:

Created an issue at the BeautifulSoup issue tracker:

  • Dealing with a colon in BeautifulSoup CSS selectors

Will update the answer in case of any updates in the launchpad issue.

like image 192
alecxe Avatar answered Oct 15 '22 22:10

alecxe