I've an html table like this: <pre class="prettyprint"><code><TABLE> <TR> <TD>Name</TD> <TD>Fees</TD> <TD>Awards</TD> <TD>Total</TD> </TR> <TR> <TD>Tony</TD> <TD >7,800</TD> <TD >7</TD> <TD>15,400</TD> </TR> <TR> <TD>Paul</TD> <TD >7,800</TD> <TD >7</TD> <TD>15,400</TD> </TR> <TR> <TD>Richard</TD> <TD >7,800</TD> <TD >7</TD> <TD>15,400</TD> </TR> </TR> </TABLE> </code></pre> I want to extract the values of table. I'd tried the following. <pre class="prettyprint"><code>import lxml.html html = lxml.html.parse(''html_table) text_value = html.xpath('//tr/td/text()') packages = html.xpath('//tr/td/p') p_content = [p.text_content() for p in packages] </code></pre> is there any way to extract both the <code></code> text and the text of <code><td></code> to a single list ?

You could do something like <pre class="prettyprint"><code>>>> doc = """<TABLE> ... <TR> ... <TD>Name</TD> ... <TD>Fees</TD> ... <TD>Awards</TD> ... <TD>Total</TD> ... </TR> ... <TR> ... <TD>Tony</TD> ... <TD >7,800</TD> ... <TD >7</TD> ... <TD>15,400</TD> ... </TR> ... <TR> ... <TD>Paul</TD> ... <TD >7,800</TD> ... <TD >7</TD> ... <TD>15,400</TD> ... </TR> ... <TR> ... <TD>Richard</TD> ... <TD >7,800</TD> ... <TD >7</TD> ... <TD>15,400</TD> ... </TR> ... ... </TR> ... </TABLE>""" >>> import lxml.html >>> root = lxml.html.fromstring(doc) >>> root.xpath('//tr/td//text()') ['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400'] >>> </code></pre> If you have 2 tables in document, you can first loop on tables and then use a relative XPath expression (with a leading <code>.</code>) for descendant text nodes on each table <pre class="prettyprint"><code>>>> doc = """<TABLE> ... <TR> ... <TD>Name</TD> ... <TD>Fees</TD> ... <TD>Awards</TD> ... <TD>Total</TD> ... </TR> ... <TR> ... <TD>Tony</TD> ... <TD >7,800</TD> ... <TD >7</TD> ... <TD>15,400</TD> ... </TR> ... <TR> ... <TD>Paul</TD> ... <TD >7,800</TD> ... <TD >7</TD> ... <TD>15,400</TD> ... </TR> ... <TR> ... <TD>Richard</TD> ... <TD >7,800</TD> ... <TD >7</TD> ... <TD>15,400</TD> ... </TR> ... ... </TR> ... </TABLE> ... <TABLE> ... <TR> ... <TD>Name</TD> ... <TD>Fees</TD> ... <TD>Awards</TD> ... <TD>Total</TD> ... </TR> ... <TR> ... <TD>Tony</TD> ... <TD >7,800</TD> ... <TD >7</TD> ... <TD>15,400</TD> ... </TR> ... <TR> ... <TD>Paul</TD> ... <TD >7,800</TD> ... <TD >7</TD> ... <TD>15,400</TD> ... </TR> ... <TR> ... <TD>Richard</TD> ... <TD >7,800</TD> ... <TD >7</TD> ... <TD>15,400</TD> ... </TR> ... ... </TR> ... </TABLE>""" >>> import lxml.html >>> root = lxml.html.fromstring(doc) >>> root.xpath('//tr/td//text()') ['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400', 'Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400'] >>> for tbl in root.xpath('//table'): ... elements = tbl.xpath('.//tr/td//text()') ... print elements ... ['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400'] ['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400'] >>> </code></pre>

python parse html table using lxml

Tags:

python

html

html-table

lxml

I've an html table like this:

Click to copy

<TABLE>
<TR>
    <TD><P>Name</P></TD>
    <TD><P>Fees</P></TD>
    <TD><P>Awards</P></TD>
    <TD><P>Total</P></TD>
</TR>
<TR>
    <TD><P>Tony</P></TD>
    <TD >7,800</TD>
    <TD >7</TD>
    <TD>15,400</TD>
</TR>
<TR>
    <TD><P>Paul</FONT></P></TD>
    <TD >7,800</TD>
    <TD >7</TD>
    <TD>15,400</TD>
</TR>
<TR>
    <TD><P>Richard</P></TD>
    <TD >7,800</TD>
    <TD >7</TD>
    <TD>15,400</TD>
</TR>

</TR>
</TABLE>

I want to extract the values of table. I'd tried the following.

Click to copy

import lxml.html
html = lxml.html.parse(''html_table)
text_value = html.xpath('//tr/td/text()')
packages = html.xpath('//tr/td/p')
p_content = [p.text_content() for p in packages]

is there any way to extract both the  text and the text of <td> to a single list ?

586

asked Dec 06 '13 07:12

Kishore K

Video Answer

1 Answers

You could do something like

Click to copy

>>> doc = """<TABLE>
... <TR>
...     <TD><P>Name</P></TD>
...     <TD><P>Fees</P></TD>
...     <TD><P>Awards</P></TD>
...     <TD><P>Total</P></TD>
... </TR>
... <TR>
...     <TD><P>Tony</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Paul</FONT></P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Richard</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... 
... </TR>
... </TABLE>"""
>>> import lxml.html
>>> root = lxml.html.fromstring(doc)
>>> root.xpath('//tr/td//text()')
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>>

If you have 2 tables in document, you can first loop on tables and then use a relative XPath expression (with a leading .) for descendant text nodes on each table

Click to copy

>>> doc = """<TABLE>
... <TR>
...     <TD><P>Name</P></TD>
...     <TD><P>Fees</P></TD>
...     <TD><P>Awards</P></TD>
...     <TD><P>Total</P></TD>
... </TR>
... <TR>
...     <TD><P>Tony</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Paul</FONT></P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Richard</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... 
... </TR>
... </TABLE>
... <TABLE>
... <TR>
...     <TD><P>Name</P></TD>
...     <TD><P>Fees</P></TD>
...     <TD><P>Awards</P></TD>
...     <TD><P>Total</P></TD>
... </TR>
... <TR>
...     <TD><P>Tony</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Paul</FONT></P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... <TR>
...     <TD><P>Richard</P></TD>
...     <TD >7,800</TD>
...     <TD >7</TD>
...     <TD>15,400</TD>
... </TR>
... 
... </TR>
... </TABLE>"""
>>> import lxml.html
>>> root = lxml.html.fromstring(doc)
>>> root.xpath('//tr/td//text()')
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400', 'Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>> for tbl in root.xpath('//table'):
...     elements = tbl.xpath('.//tr/td//text()')
...     print elements
... 
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
['Name', 'Fees', 'Awards', 'Total', 'Tony', '7,800', '7', '15,400', 'Paul', '7,800', '7', '15,400', 'Richard', '7,800', '7', '15,400']
>>>

answered Oct 02 '22 22:10

paul trmbrth

Related questions
                            
                                Using CX_Freeze with Scipy: scipy.special._ufuncs.py
                            
                                Accessing axis or figure in python ggplot
                            
                                SQLAlchemy relationship cascade deletion
                            
                                Weird lambda behaviour in loops [duplicate]
                            
                                Scraping text without javascript code using scrapy
                            
                                Why is skimage.transform.rotate significantly slower than PIL's Image.rotate?
                            
                                Django: Dynamically add apps as plugin, building urls and other settings automatically
                            
                                Faster alternative to Series.add function in pandas
                            
                                Pythonic way to find key of weighted minimum and maximum from a dictionary
                            
                                ImportError: /usr/lib/libboost_python.so.1.54.0: undefined symbol: PyClass_Type
                            
                                Django 1.6 and Celery 3.0 memory leaks
                            
                                Call functions in AutoIt DLL using Python ctypes
                            
                                How can I change an urwid.Edit's text from the 'change' signal handler?
                            
                                how to add rrule to icalendar event in python?
                            
                                Change Output Redirection of Running Process
                            
                                Why does this Python script have a \ before the multi-line string and what does it do?
                            
                                Mocking test in Django not working when running all in TestCase but works well one by one
                            
                                Significance of a PATH explained
                            
                                python: regular expression search pattern for binary files (half a byte)
                            
                                Prefetch related django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python parse html table using lxml

Tags:

python

html

html-table

lxml

Kishore K

People also ask

Video Answer

1 Answers

paul trmbrth

Recent Activity

Donate For Us