I'm working on a project where I have to parse 20 different HTML pages based on URLs and I would like to get some information from all of them. Pages have different structure and the required information is on a different place on each site.
I thought I could give Python lxml module a try. Since information can be found on different places on each site and I'm quite lazy to put together 20 * X different reg. expressions, I thought it would be a good idea to use absolute XPaths for these elements. In this way I could simply utilize Copy XPath functionality of Chrome browser and give clear path for each HTML element to my parser and I don't need to code a lot.
I couldn't find any example which shows me how can I refer to an HTML element with absolute XPath in Python. Some of the comments say instead of absolute path it's better to use relative, but don't really explain why. But again, to refer to an element with its relative XPath means some coding work again.
Just to make it more complicated these 20 sites are unicode.
Is there a way to refer to an HTML element with absolute XPath in Python and get back its text value like this?
/html/body/div[1]/table/tbody/tr[2]/td[2]/table/tbody/tr/td[2]/div/table/tbody/tr[3]/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/font/b
...and it would give back the text value of the HTML element.
So far I got the following code which works well with relative XPath but when I'm using absolute it gives me the error below.
import urllib2
from lxml import html
from bs4 import UnicodeDammit
response = urllib2.urlopen('http://oneofthesites.com')
content = response.read()
doc = UnicodeDammit(content, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
root = html.document_fromstring(content, parser=parser)
data = root.find('/html/body/div[1]/table/tbody/tr[2]/td[2]/table/tbody/tr/td[2]/div/table/tbody/tr[1]/td[2]/b').text_content()
print(data)
and the error is:
SyntaxError: cannot use absolute path on element
Maybe my basic concept is wrong so any other idea regarding how can I process these pages is welcome!
Thanks for your help in advance, g0m3z
You are using html.document_fromstring(); this returns an Element, not a ElementTree object. Absolute paths are only supported on the latter type.
You have two options:
Use html.parse(response) (note, not the result of response.read()); this returns a proper tree object.
Use a relative XPath expression. Simply replace /html with .; the top level element is after all the <html> tag, the rest is relative to that element:
data = root.find('./body/div[1]/table/tbody/tr[2]/td[2]/table/tbody/tr/td[2]/div/table/tbody/tr[1]/td[2]/b').text_content()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With