Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python library to do jQuery-like text extraction?

I've got html that contains entries like this:

<div class="entry">
  <h3 class="foo">
    <a href="http://www.example.com/blog-entry-slug"
    rel="bookmark">Blog Entry</a>
  </h3>
  ...
</div>

and I would like to extract the text "Blog Entry" (and a number of other attributes, so I'm looking for a generic answer).

In jQuery, I would do

$('.entry a[rel=bookmark]').text()

the closest I've been able to get in Python is:

from BeautifulSoup import BeautifulSoup
import soupselect as soup

rawsoup = BeautifulSoup(open('fname.html').read())

for entry in rawsoup.findAll('div', 'entry'):
    print soup.select(entry, 'a[rel=bookmark]')[0].string.strip()

soupselect from http://code.google.com/p/soupselect/.

Soupselect doesn't understand the full CSS3 selector syntax, like jQuery does however. Is there such a beast in Python?

like image 966
thebjorn Avatar asked Dec 13 '10 07:12

thebjorn


People also ask

How do I get just the text from HTML in jQuery?

You could use $('. gettext'). text(); in jQuery.

Which of the following functions helps select the first element from the list of selected elements by the jQuery function?

The :first selector selects the first element. Note: This selector can only select one single element. Use the :first-child selector to select more than one element (one for each parent).


2 Answers

You might want to take a look at lxml's CSSSelector class which tries to implement CSS selectors as described in the w3c specification. As a side note, many folks recommend lxml for parsing HTML/XML over BeautifulSoup now, for performance and other reasons.

I think lxml's CSSSelector uses XPath for element selection, but you might want to check the documentation for yourself. Here's your example with lxml:

>>> from lxml.cssselect import CSSSelector
>>> from lxml.html import fromstring
>>> html = '<div class="entry"><h3 class="foo"><a href="http://www.example.com/blog-entry-slug" rel="bookmark">Blog Entry</a></h3></div>'
>>> h = fromstring(html)
>>> sel = CSSSelector("a[rel=bookmark]")
>>> [e.text for e in sel(h)]
['Blog Entry']
like image 71
Haes Avatar answered Sep 26 '22 22:09

Haes


You might also want to have a look at pyquery. pyquery is a jquery-like library for python. Find it here

like image 20
Aman Aggarwal Avatar answered Sep 25 '22 22:09

Aman Aggarwal