Python: Parsing HTML elements based on absolute XPath

Question

I'm working on a project where I have to parse 20 different HTML pages based on URLs and I would like to get some information from all of them. Pages have different structure and the required information is on a different place on each site.

I thought I could give Python lxml module a try. Since information can be found on different places on each site and I'm quite lazy to put together 20 * X different reg. expressions, I thought it would be a good idea to use absolute XPaths for these elements. In this way I could simply utilize Copy XPath functionality of Chrome browser and give clear path for each HTML element to my parser and I don't need to code a lot.

I couldn't find any example which shows me how can I refer to an HTML element with absolute XPath in Python. Some of the comments say instead of absolute path it's better to use relative, but don't really explain why. But again, to refer to an element with its relative XPath means some coding work again.

Just to make it more complicated these 20 sites are unicode.

Is there a way to refer to an HTML element with absolute XPath in Python and get back its text value like this?

/html/body/div[1]/table/tbody/tr[2]/td[2]/table/tbody/tr/td[2]/div/table/tbody/tr[3]/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/font/b

...and it would give back the text value of the HTML element.

So far I got the following code which works well with relative XPath but when I'm using absolute it gives me the error below.

import urllib2
from lxml import html
from bs4 import UnicodeDammit


response = urllib2.urlopen('http://oneofthesites.com')
content = response.read()
doc = UnicodeDammit(content, is_html=True)
parser = html.HTMLParser(encoding=doc.original_encoding)
root = html.document_fromstring(content, parser=parser)
data = root.find('/html/body/div[1]/table/tbody/tr[2]/td[2]/table/tbody/tr/td[2]/div/table/tbody/tr[1]/td[2]/b').text_content()
print(data)

and the error is:

SyntaxError: cannot use absolute path on element

Maybe my basic concept is wrong so any other idea regarding how can I process these pages is welcome!

Thanks for your help in advance, g0m3z

Martijn Pieters · Accepted Answer

You are using html.document_fromstring(); this returns an Element, not a ElementTree object. Absolute paths are only supported on the latter type.

You have two options:

Use html.parse(response) (note, not the result of response.read()); this returns a proper tree object.
Use a relative XPath expression. Simply replace /html with .; the top level element is after all the <html> tag, the rest is relative to that element:
```
data = root.find('./body/div[1]/table/tbody/tr[2]/td[2]/table/tbody/tr/td[2]/div/table/tbody/tr[1]/td[2]/b').text_content()
```

Python: Parsing HTML elements based on absolute XPath

Tags:

python

html

parsing

absolute

xpath

g0m3z

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

Python: Parsing HTML elements based on absolute XPath

Tags:

python

html

parsing

absolute

xpath

g0m3z

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us