Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing HTML data with lxml

I'm a beginner in coding and a friend of mine told me to use BeautifulSoup instead of htmlparser. After running into some problems I got a tip to use lxml instead of BeaytifulSoup because it's 10x better.

I'm hoping someone can give me a hint how to scrape the text I'm looking for.

What I want is to find a table with the following rows and data:

<tr>
    <td><a href="website1.com">website1</a></td>
    <td>info1</td>
    <td>info2</td>              
    <td><a href="spam1.com">spam1</a></td>
</tr>
<tr>
    <td><a href="website2.com">website2</a></td>
    <td>info1</td>
    <td>info2</td>              
    <td><a href="spam2.com">spam2</a></td>
</tr>

How do I scrape the website with info 1 and 2, without spam, with lxml and get the following results?

[['url' 'info1', 'info2'], ['url', 'info1', 'info2']]
like image 868
Retrace Avatar asked Feb 01 '26 23:02

Retrace


2 Answers

import lxml.html as lh

tree = lh.fromstring(your_html)

result = []
for row in tree.xpath("tr"):
    url, info1, info2 = row.xpath("td")[:3]
    result.append([url.xpath("a")[0].attrib['href'],
                   info1.text_content(),
                   info2.text_content()])

Result:

[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]
like image 189
Acorn Avatar answered Feb 04 '26 13:02

Acorn


I use the xpath: td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()

$ python3
>>> import lxml.html
>>> doc = lxml.html.parse('data.xml')
>>> [[j for j in i.xpath('td/a[not(contains(.,"spam"))]/@href | td[not(a)]/text()')] for i in doc.xpath('//tr')]
[['website1.com', 'info1', 'info2'], ['website2.com', 'info1', 'info2']]
like image 40
kev Avatar answered Feb 04 '26 13:02

kev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!