Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using BeautifulSoup To Extract Specific TD Table Elements Text?

I trying to extract IP Addresses from a autogenerated HTML table using the BeautifulSoup library and im having a little trouble.

The HTML is structured like so:

<html>
<body>
    <table class="mainTable">
    <thead>
        <tr>
            <th>IP</th>
            <th>Country</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><a href="hello.html">127.0.0.1<a></td>
            <td><img src="uk.gif" /><a href="uk.com">uk</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">192.168.0.1<a></td>
            <td><img src="uk.gif" /><a href="us.com">us</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">255.255.255.0<a></td>
            <td><img src="uk.gif" /><a href="br.com">br</a></td>
        </tr>
    </tbody>
</table>

The small code below extracts the text from the two td rows but i only need the IP data, not the IP and Country data:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.htm"))

table = soup.find('table', {'class': 'mainTable'})
for row in table.findAll("a"):
print(row.text)

this outputs:

127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br

What i need is the IP table.tbody.tr.td.a elements text and not the country table.tbody.tr.td.img.a elements.

Are there any experienced users of BeautifulSoup who would have any inkling on how to to this selection and extraction.

Thanks.

like image 900
Pike Man Avatar asked Mar 30 '14 15:03

Pike Man


People also ask

How do I find a specific element with BeautifulSoup?

BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.

How do you scrape a tag with BeautifulSoup?

Step-by-step Approach Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.


Video Answer


4 Answers

This gives you the right list:

>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[<a href="hello.html">127.0.0.1<a></a></a>, <a></a>, <a href="hello.html">192.168.0.1<a></a></a>, <a></a>, <a href="hello.html">255.255.255.0<a></a></a>, <a></a>]

just apply .text on the elements of this list.

There are multiple empty <a></a> tags in above list because the <a> tags in the html are not closed properly. To get rid of them, you may use

pred = lambda tag: tag.parent.find('img') is None and tag.text

and ultimately:

>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']
like image 116
behzad.nouri Avatar answered Oct 12 '22 22:10

behzad.nouri


You can use a little regular expression for extracting the ip address. BeautifulSoup with regular expression is a nice combination for scraping.

ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
    if ip_pat.match(row.text):
        print(row.text)    
like image 22
salmanwahed Avatar answered Oct 12 '22 20:10

salmanwahed


Search just first <td> for each row in tbody:

# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]

or maybe more readable:

rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]
like image 23
m.wasowski Avatar answered Oct 12 '22 21:10

m.wasowski


First note that the HTML is not well-formed. It is not closing the a tags. There are two <a> tags started here:

<a href="hello.html">127.0.0.1<a>

If you print table you'll see BeautifulSoup is parsing the HTML as:

<td>
<a href="hello.html">127.0.0.1</a><a></a>
</td>
...

Each a is followed by an empty a.


Given the presence of those extra <a> tags, if you want every third <a> tag, then

for row in table.findAll("a")[::3]:
    print(row.get_text())

suffices:

127.0.0.1
192.168.0.1
255.255.255.0

On the other hand, if the occurrence of <a> tags is not so regular and you only want that <a> tags with no previous sibling (such as, but not limited to <img>), then

for row in table.findAll("a"):
    sibling = row.findPreviousSibling()
    if sibling is None:
        print(row.get_text())

would work.


If you have lxml, the criteria can be expressed more succinctly using XPath:

import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[@class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)

The XPath used above has the following meaning:

//table                            select all <table> tags
    [@class="mainTable"]           that have a class="mainTable" attribute
//                                 from these tags select descendants
  td/a                             which are td tags with a child <a> tag
    [not(preceding-sibling::img)]  such that it does not have a preceding sibling <img> tag
    /text()                        return the text of the <a> tag 

It does take a little time to learn XPath, but once you learn it you may never want to use BeautifulSoup again.

like image 41
unutbu Avatar answered Oct 12 '22 22:10

unutbu