I am trying to parse review from this page: http://www.amazon.co.uk/product-reviews/B00143ZBHY
Using following approach:
Code
html # a variable which contains exact html as given at the above page.
from lxml import etree
tree = etree.HTML(html)
r = tree.xpath(".//*[@id='productReviews']/tbody/tr/td[1]/div[9]/text()[4]")
print len(r)
print r[0].tag
Output
0
Traceback (most recent call last):
File "c.py", line 37, in <module>
print r[0].tag
IndexError: list index out of range
p,s,: While using the same xpath on xpath checker addon of firefox I am able todo it easily. But no result here, please help!
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.
Try to remove /tbody
form XPath — there is no <tbody>
in #productReviews
.
import urllib2
html = urllib2.urlopen("http://www.amazon.co.uk/product-reviews/B00143ZBHY").read()
from lxml import etree
tree = etree.HTML(html)
r = tree.xpath(".//*[@id='productReviews']/tr/td[1]/div[9]/text()[4]")
print r[0]
Output:
bought this as replacement for the original cover which came with my greenhouse and which ripped in the wind. so far this seems a good replacement although for some reason it seems slightly too small for my greenhouse so that i cant zip both sides of the front at the same time. seems sturdier and thicker than the cover i had before so hoping it lasts a bit longer!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With