I have a html doc similar to following:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
<div id="Symbols" class="cb">
<table class="quotes">
<tr><th>Code</th><th>Name</th>
<th style="text-align:right;">High</th>
<th style="text-align:right;">Low</th>
</tr>
<tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;">
<td><a href="/xyz.com/A.htm" title="Display,A">A</a></td>
<td>A Inc.</td>
<td align="right">45.44</td>
<td align="right">44.26</td>
<tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;">
<td><a href="/xyz.com/B.htm" title="Display,B">B</a></td>
<td>B Inc.</td>
<td align="right">18.29</td>
<td align="right">17.92</td>
</div></html>
I need to extract code/name/high/low
information from the table.
I used following code from one of the similar examples in Stack Over Flow:
#############################
import urllib2
from lxml import html, etree
webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html.fromstring(webpg)
for row in table.xpath('//table[@class="quotes"]/tbody/tr'):
for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'):
print column.strip(),
print
#############################
I am getting nothing output. I have to change the first loop xpath to table.xpath('//tr')
from table.xpath('//table[@class="quotes"]/tbody/tr')
I just don't understand why the xpath('//table[@class="quotes"]/tbody/tr')
not work.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
The xpath() method For ElementTree, the xpath method performs a global XPath query against the document (if absolute) or against the root node (if relative): >>> f = StringIO('<foo><bar></bar></foo>') >>> tree = etree.
lxml is not written in plain Python, because it interfaces with two C libraries: libxml2 and libxslt.
lxml Module in Python. lxml module of Python is an XML toolkit that is basically a Pythonic binding of the following two C libraries: libxlst and libxml2. lxml module is a very unique and special module of Python as it offers a combination of XML features and speed.
You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <tbody>
when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.
Omit the tbody level in your XPath. For example, this works:
tree = lxml.html.fromstring(raw_html)
tree.xpath('//table[@class="quotes"]/tr')
[<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With