<p>I have a html doc similar to following:</p> <pre class="prettyprint"><code><html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml"> <div id="Symbols" class="cb"> <table class="quotes"> <tr><th>Code</th><th>Name</th> <th style="text-align:right;">High</th> <th style="text-align:right;">Low</th> </tr> <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;"> <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td> <td>A Inc.</td> <td align="right">45.44</td> <td align="right">44.26</td> <tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;"> <td><a href="/xyz.com/B.htm" title="Display,B">B</a></td> <td>B Inc.</td> <td align="right">18.29</td> <td align="right">17.92</td> </div></html> </code></pre> <p>I need to extract <code>code/name/high/low</code> information from the table.</p> <p>I used following code from one of the similar examples in Stack Over Flow:</p> <pre class="prettyprint"><code>############################# import urllib2 from lxml import html, etree webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read() table = html.fromstring(webpg) for row in table.xpath('//table[@class="quotes"]/tbody/tr'): for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'): print column.strip(), print ############################# </code></pre> <p>I am getting nothing output. I have to change the first loop xpath to <code>table.xpath('//tr')</code> from <code>table.xpath('//table[@class="quotes"]/tbody/tr')</code></p> <p>I just don't understand why the <code>xpath('//table[@class="quotes"]/tbody/tr')</code> not work.</p>

<p>You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <code><tbody></code> when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.</p> <p>Omit the <strong>tbody</strong> level in your XPath. For example, this works:</p> <pre class="prettyprint"><code>tree = lxml.html.fromstring(raw_html) tree.xpath('//table[@class="quotes"]/tr') [<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>] </code></pre>

Extracting lxml xpath for html table

Tags:

python

html

html-table

xpath

lxml

I have a html doc similar to following:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns="http://www.w3.org/1999/xhtml">
    <div id="Symbols" class="cb">
    <table class="quotes">
    <tr><th>Code</th><th>Name</th>
        <th style="text-align:right;">High</th>
        <th style="text-align:right;">Low</th>
    </tr>
    <tr class="ro" onclick="location.href='/xyz.com/A.htm';" style="color:red;">
        <td><a href="/xyz.com/A.htm" title="Display,A">A</a></td>
        <td>A Inc.</td>
        <td align="right">45.44</td>
        <td align="right">44.26</td>
    <tr class="re" onclick="location.href='/xyz.com/B.htm';" style="color:red;">
        <td><a href="/xyz.com/B.htm" title="Display,B">B</a></td>
        <td>B Inc.</td>
        <td align="right">18.29</td>
        <td align="right">17.92</td>
</div></html>

I need to extract code/name/high/low information from the table.

I used following code from one of the similar examples in Stack Over Flow:

#############################
import urllib2
from lxml import html, etree

webpg = urllib2.urlopen(http://www.eoddata.com/stocklist/NYSE/A.htm).read()
table = html.fromstring(webpg)

for row in table.xpath('//table[@class="quotes"]/tbody/tr'):
    for column in row.xpath('./th[position()>0]/text() | ./td[position()=1]/a/text() | ./td[position()>1]/text()'):
        print column.strip(),
    print

#############################

I am getting nothing output. I have to change the first loop xpath to table.xpath('//tr') from table.xpath('//table[@class="quotes"]/tbody/tr')

I just don't understand why the xpath('//table[@class="quotes"]/tbody/tr') not work.

889

asked Apr 07 '11 19:04

mkt2012

1 Answers

You are probably looking at the HTML in Firebug, correct? The browser will insert the implicit tag <tbody> when it is not present in the document. The lxml library will only process the tags present in the raw HTML string.

Omit the tbody level in your XPath. For example, this works:

tree = lxml.html.fromstring(raw_html)
tree.xpath('//table[@class="quotes"]/tr')
[<Element tr at 1014206d0>, <Element tr at 101420738>, <Element tr at 1014207a0>]

191

answered Oct 10 '22 02:10

samplebias

Related questions
                            
                                Difference between --default and --store_const in argparse
                            
                                How to slice middle element from list
                            
                                Iterating Through Table Rows in Selenium (Python)
                            
                                DataFrame sorting based on a function of multiple column values
                            
                                Pandas slicing/selecting with multiple conditions with or statement
                            
                                Generate random number in range excluding some numbers
                            
                                ImportError: No module named 'cv2' Python3
                            
                                Upload File to Google Cloud Storage Bucket Sub Directory using Python
                            
                                Mapping ranges of values in pandas dataframe [duplicate]
                            
                                The client is using an unsupported version of the Socket.IO or Engine.IO protocols Error
                            
                                How can I hide the console window in a PyQt app running on Windows?
                            
                                Generate SQL statements with python [duplicate]
                            
                                What's the best way to search for a Python dictionary value in a list of dictionaries?
                            
                                Storing Python dictionary entries in the order they are pushed [duplicate]
                            
                                Using explicit del in python on local variables
                            
                                deepcopy and python - tips to avoid using it?
                            
                                Python: __init__() takes exactly 2 arguments (3 given)
                            
                                Scraping websites with Javascript enabled?
                            
                                installing MySQLdb for Python 2.6 on OSX [duplicate]
                            
                                Rotate logfiles each time the application is started (Python)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With