Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting lxml xpath for html table

I have a html doc similar to following:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
    <div id="Symbols" class="cb">
    <table class="quotes">
    <tr><th>Code</th><th>Name</th>
        <th style="text-align:right;">High</th>
        <th style="text-align:right;">Low</th>
    </tr>
    <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;">
        <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td>
        <td>A Inc.</td>
        <td align="right">45.44</td>
        <td align="right">44.26</td>
    <tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;">
        <td><a href="/xyz.com/B.htm" title="Display,B">B</a></td>
        <td>B Inc.</td>
        <td align="right">18.29</td>
        <td align="right">17.92</td>
</div></html>

I need to extract code/name/high/low information from the table.

I used following code from one of the similar examples in Stack Over Flow:

#############################
import urllib2
from lxml import html, etree

webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html.fromstring(webpg)

for row in table.xpath('//table[@class="quotes"]/tbody/tr'):
    for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'):
        print column.strip(),
    print

#############################

I am getting nothing output. I have to change the first loop xpath to table.xpath('//tr') from table.xpath('//table[@class="quotes"]/tbody/tr')

I just don't understand why the xpath('//table[@class="quotes"]/tbody/tr') not work.

like image 889
mkt2012 Avatar asked Apr 07 '11 19:04

mkt2012


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What is XPath in lxml?

The xpath() method For ElementTree, the xpath method performs a global XPath query against the document (if absolute) or against the root node (if relative): >>> f = StringIO('<foo><bar></bar></foo>') >>> tree = etree.

Is lxml standard Python library?

lxml is not written in plain Python, because it interfaces with two C libraries: libxml2 and libxslt.

What is lxml in Python?

lxml Module in Python. lxml module of Python is an XML toolkit that is basically a Pythonic binding of the following two C libraries: libxlst and libxml2. lxml module is a very unique and special module of Python as it offers a combination of XML features and speed.


1 Answers

You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <tbody> when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.

Omit the tbody level in your XPath. For example, this works:

tree = lxml.html.fromstring(raw_html)
tree.xpath('//table[@class="quotes"]/tr')
[<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>]
like image 191
samplebias Avatar answered Oct 10 '22 02:10

samplebias