Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get a html elements with python lxml

Tags:

python

xml

lxml

I have this html code:

<table>
 <tr>
  <td class="test"><b><a href="">aaa</a></b></td>
  <td class="test">bbb</td>
  <td class="test">ccc</td>
  <td class="test"><small>ddd</small></td>
 </tr>
 <tr>
  <td class="test"><b><a href="">eee</a></b></td>
  <td class="test">fff</td>
  <td class="test">ggg</td>
  <td class="test"><small>hhh</small></td>
 </tr>
</table>

I use this Python code to extract all <td class="test"> with lxml module.

import urllib2
import lxml.html

code   = urllib.urlopen("http://www.example.com/page.html").read()
html   = lxml.html.fromstring(code)
result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]')

It works good! The result is:

<td class="test"><b><a href="">aaa</a></b></td>
<td class="test"><small>ddd</small></td>


<td class="test"><b><a href="">eee</a></b></td>
<td class="test"><small>hhh</small></td>

(so the first and the fourth column of each <tr>) Now, I have to extract:

aaa (the title of the link)

ddd (text between <small> tag)

eee (the title of the link)

hhh (text between <small> tag)

How could I extract these values?

(the problem is that I have to remove <b> tag and get the title of the anchor on the first column and remove <small> tag on the forth column)

Thank you!

like image 571
Damiano Avatar asked Dec 17 '22 00:12

Damiano


1 Answers

If you do el.text_content() you'll strip all the tag stuff from each element, i.e.:

result = [el.text_content() for el in result]
like image 115
Ian Bicking Avatar answered Dec 28 '22 09:12

Ian Bicking