I've got html that contains entries like this:
<div class="entry">
<h3 class="foo">
<a href="http://www.example.com/blog-entry-slug"
rel="bookmark">Blog Entry</a>
</h3>
...
</div>
and I would like to extract the text "Blog Entry" (and a number of other attributes, so I'm looking for a generic answer).
In jQuery, I would do
$('.entry a[rel=bookmark]').text()
the closest I've been able to get in Python is:
from BeautifulSoup import BeautifulSoup
import soupselect as soup
rawsoup = BeautifulSoup(open('fname.html').read())
for entry in rawsoup.findAll('div', 'entry'):
print soup.select(entry, 'a[rel=bookmark]')[0].string.strip()
soupselect from http://code.google.com/p/soupselect/.
Soupselect doesn't understand the full CSS3 selector syntax, like jQuery does however. Is there such a beast in Python?
You could use $('. gettext'). text(); in jQuery.
The :first selector selects the first element. Note: This selector can only select one single element. Use the :first-child selector to select more than one element (one for each parent).
You might want to take a look at lxml's CSSSelector class which tries to implement CSS selectors as described in the w3c specification. As a side note, many folks recommend lxml for parsing HTML/XML over BeautifulSoup now, for performance and other reasons.
I think lxml's CSSSelector uses XPath for element selection, but you might want to check the documentation for yourself. Here's your example with lxml:
>>> from lxml.cssselect import CSSSelector
>>> from lxml.html import fromstring
>>> html = '<div class="entry"><h3 class="foo"><a href="http://www.example.com/blog-entry-slug" rel="bookmark">Blog Entry</a></h3></div>'
>>> h = fromstring(html)
>>> sel = CSSSelector("a[rel=bookmark]")
>>> [e.text for e in sel(h)]
['Blog Entry']
You might also want to have a look at pyquery. pyquery is a jquery-like library for python. Find it here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With