Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

can we use XPath with BeautifulSoup?

People also ask

Can I use XPath with BeautifulSoup?

Nope, BeautifulSoup, by itself, does not support XPath expressions.

Can we use XPath in Python?

XPath Standard Functions. XPath includes over 200 built-in functions. There are functions for string values, numeric values, booleans, date and time comparison, node manipulation, sequence manipulation, and much more. These expressions can be used in JavaScript, Java, XML Schema, Python, and lots of other languages.

Can BeautifulSoup parse HTML?

The HTML content of the webpages can be parsed and scraped with Beautiful Soup.


Nope, BeautifulSoup, by itself, does not support XPath expressions.

An alternative library, lxml, does support XPath 1.0. It has a BeautifulSoup compatible mode where it'll try and parse broken HTML the way Soup does. However, the default lxml HTML parser does just as good a job of parsing broken HTML, and I believe is faster.

Once you've parsed your document into an lxml tree, you can use the .xpath() method to search for elements.

try:
    # Python 2
    from urllib2 import urlopen
except ImportError:
    from urllib.request import urlopen
from lxml import etree

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
tree.xpath(xpathselector)

There is also a dedicated lxml.html() module with additional functionality.

Note that in the above example I passed the response object directly to lxml, as having the parser read directly from the stream is more efficient than reading the response into a large string first. To do the same with the requests library, you want to set stream=True and pass in the response.raw object after enabling transparent transport decompression:

import lxml.html
import requests

url =  "http://www.example.com/servlet/av/ResultTemplate=AVResult.html"
response = requests.get(url, stream=True)
response.raw.decode_content = True
tree = lxml.html.parse(response.raw)

Of possible interest to you is the CSS Selector support; the CSSSelector class translates CSS statements into XPath expressions, making your search for td.empformbody that much easier:

from lxml.cssselect import CSSSelector

td_empformbody = CSSSelector('td.empformbody')
for elem in td_empformbody(tree):
    # Do something with these table cells.

Coming full circle: BeautifulSoup itself does have very complete CSS selector support:

for cell in soup.select('table#foobar td.empformbody'):
    # Do something with these table cells.

I can confirm that there is no XPath support within Beautiful Soup.


As others have said, BeautifulSoup doesn't have xpath support. There are probably a number of ways to get something from an xpath, including using Selenium. However, here's a solution that works in either Python 2 or 3:

from lxml import html
import requests

page = requests.get('http://econpy.pythonanywhere.com/ex/001.html')
tree = html.fromstring(page.content)
#This will create a list of buyers:
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
#This will create a list of prices
prices = tree.xpath('//span[@class="item-price"]/text()')

print('Buyers: ', buyers)
print('Prices: ', prices)

I used this as a reference.


BeautifulSoup has a function named findNext from current element directed childern,so:

father.findNext('div',{'class':'class_value'}).findNext('div',{'id':'id_value'}).findAll('a') 

Above code can imitate the following xpath:

div[class=class_value]/div[id=id_value]