<p>I've got html that contains entries like this:</p> <pre class="prettyprint"><code><div class="entry"> <h3 class="foo"> <a href="http://www.example.com/blog-entry-slug" rel="bookmark">Blog Entry</a> </h3> ... </div> </code></pre> <p>and I would like to extract the text "Blog Entry" (and a number of other attributes, so I'm looking for a generic answer).</p> <p>In jQuery, I would do </p> <pre class="prettyprint"><code>$('.entry a[rel=bookmark]').text() </code></pre> <p>the closest I've been able to get in Python is:</p> <pre class="prettyprint"><code>from BeautifulSoup import BeautifulSoup import soupselect as soup rawsoup = BeautifulSoup(open('fname.html').read()) for entry in rawsoup.findAll('div', 'entry'): print soup.select(entry, 'a[rel=bookmark]')[0].string.strip() </code></pre> <p>soupselect from http://code.google.com/p/soupselect/.</p> <p>Soupselect doesn't understand the full CSS3 selector syntax, like jQuery does however. Is there such a beast in Python?</p>

<p>You might want to take a look at lxml's CSSSelector class which tries to implement CSS selectors as described in the w3c specification. As a side note, many folks recommend lxml for parsing HTML/XML over BeautifulSoup now, for performance and other reasons.</p> <p>I think lxml's CSSSelector uses XPath for element selection, but you might want to check the documentation for yourself. Here's your example with lxml:</p> <pre class="prettyprint"><code>>>> from lxml.cssselect import CSSSelector >>> from lxml.html import fromstring >>> html = '<div class="entry"><h3 class="foo"><a href="http://www.example.com/blog-entry-slug" rel="bookmark">Blog Entry</a></h3></div>' >>> h = fromstring(html) >>> sel = CSSSelector("a[rel=bookmark]") >>> [e.text for e in sel(h)] ['Blog Entry'] </code></pre>

Python library to do jQuery-like text extraction?

Tags:

python

jquery

css-selectors

beautifulsoup

I've got html that contains entries like this:

<div class="entry">
  <h3 class="foo">
    <a href="http://www.example.com/blog-entry-slug"
    rel="bookmark">Blog Entry</a>
  </h3>
  ...
</div>

and I would like to extract the text "Blog Entry" (and a number of other attributes, so I'm looking for a generic answer).

In jQuery, I would do

$('.entry a[rel=bookmark]').text()

the closest I've been able to get in Python is:

from BeautifulSoup import BeautifulSoup
import soupselect as soup

rawsoup = BeautifulSoup(open('fname.html').read())

for entry in rawsoup.findAll('div', 'entry'):
    print soup.select(entry, 'a[rel=bookmark]')[0].string.strip()

soupselect from http://code.google.com/p/soupselect/.

Soupselect doesn't understand the full CSS3 selector syntax, like jQuery does however. Is there such a beast in Python?

966

asked Dec 13 '10 07:12

thebjorn

2 Answers

You might want to take a look at lxml's CSSSelector class which tries to implement CSS selectors as described in the w3c specification. As a side note, many folks recommend lxml for parsing HTML/XML over BeautifulSoup now, for performance and other reasons.

I think lxml's CSSSelector uses XPath for element selection, but you might want to check the documentation for yourself. Here's your example with lxml:

>>> from lxml.cssselect import CSSSelector
>>> from lxml.html import fromstring
>>> html = '<div class="entry"><h3 class="foo"><a href="http://www.example.com/blog-entry-slug" rel="bookmark">Blog Entry</a></h3></div>'
>>> h = fromstring(html)
>>> sel = CSSSelector("a[rel=bookmark]")
>>> [e.text for e in sel(h)]
['Blog Entry']

answered Sep 26 '22 22:09

Haes

You might also want to have a look at pyquery. pyquery is a jquery-like library for python. Find it here

answered Sep 25 '22 22:09

Aman Aggarwal

Related questions
                            
                                Is it possible to use Materializecss without jQuery?
                            
                                Getting data attribute of html element in react.js context
                            
                                Cancel jQuery event handling
                            
                                Quick jQuery question: Stopping event propagation?
                            
                                Convert .NET DateTimeFormatInfo to Javascript jQuery formatDate?
                            
                                Four variations of jQuery ready() -- what's the difference?
                            
                                Get attribute values as array from selection of elements using jQuery
                            
                                How to choose elements but exclude first and last elements
                            
                                Shift + mouseover with jQuery
                            
                                Making Modal Wizard
                            
                                Understanding JQGrid column width behaviors
                            
                                jquery select element with multiple attributes
                            
                                jquery trigger action on focus or click but not both
                            
                                jQuery: Setting 'style' attribute of element with object
                            
                                Too many jquery plugins?
                            
                                Extract src attribute from script tag and parse according to particular matches
                            
                                jQuery UI datepicker - clearing the altField when the primary field is cleared
                            
                                Display modal form before user leaves page
                            
                                Limit the result in jQuery Autocomplete
                            
                                the type or namespace name 'webmethod' could not be found

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With