I trying to extract IP Addresses from a autogenerated HTML table using the BeautifulSoup library and im having a little trouble. The HTML is structured like so: <pre class="prettyprint"><code><html> <body> <table class="mainTable"> <thead> <tr> <th>IP</th> <th>Country</th> </tr> </thead> <tbody> <tr> <td><a href="hello.html">127.0.0.1<a></td> <td><img src="uk.gif" /><a href="uk.com">uk</a></td> </tr> <tr> <td><a href="hello.html">192.168.0.1<a></td> <td><img src="uk.gif" /><a href="us.com">us</a></td> </tr> <tr> <td><a href="hello.html">255.255.255.0<a></td> <td><img src="uk.gif" /><a href="br.com">br</a></td> </tr> </tbody> </table> </code></pre> The small code below extracts the text from the two td rows but i only need the IP data, not the IP and Country data: <pre class="prettyprint"><code>from bs4 import BeautifulSoup soup = BeautifulSoup(open("data.htm")) table = soup.find('table', {'class': 'mainTable'}) for row in table.findAll("a"): print(row.text) </code></pre> this outputs: <pre class="prettyprint"><code>127.0.0.1 uk 192.168.0.1 us 255.255.255.0 br </code></pre> What i need is the IP <code>table.tbody.tr.td.a</code> elements text and not the country <code>table.tbody.tr.td.img.a</code> elements. Are there any experienced users of BeautifulSoup who would have any inkling on how to to this selection and extraction. Thanks.

This gives you the right list: <pre class="prettyprint"><code>>>> pred = lambda tag: tag.parent.find('img') is None >>> list(filter(pred, soup.find('tbody').find_all('a'))) [<a href="hello.html">127.0.0.1<a></a></a>, <a></a>, <a href="hello.html">192.168.0.1<a></a></a>, <a></a>, <a href="hello.html">255.255.255.0<a></a></a>, <a></a>] </code></pre> just apply <code>.text</code> on the elements of this list. There are multiple empty <code><a></a></code> tags in above list because the <code><a></code> tags in the html are not closed properly. To get rid of them, you may use <pre class="prettyprint"><code>pred = lambda tag: tag.parent.find('img') is None and tag.text </code></pre> and ultimately: <pre class="prettyprint"><code>>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))] ['127.0.0.1', '192.168.0.1', '255.255.255.0'] </code></pre>

You can use a little regular expression for extracting the ip address. BeautifulSoup with regular expression is a nice combination for scraping. <pre class="prettyprint"><code>ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$") for row in table.findAll("a"): if ip_pat.match(row.text): print(row.text) </code></pre>

Search just first <code><td></code> for each row in <code>tbody</code>: <pre class="prettyprint"><code># html should contain page content: [row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')] </code></pre> or maybe more readable: <pre class="prettyprint"><code>rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')] iplist = [row.find('td').getText() for row in rows] </code></pre>

First note that the HTML is not well-formed. It is not closing the <code>a</code> tags. There are two <code><a></code> tags started here: <pre class="prettyprint"><code><a href="hello.html">127.0.0.1<a> </code></pre> If you print <code>table</code> you'll see BeautifulSoup is parsing the HTML as: <pre class="prettyprint"><code><td> <a href="hello.html">127.0.0.1</a><a></a> </td> ... </code></pre> Each <code>a</code> is followed by an empty <code>a</code>. <hr> Given the presence of those extra <code><a></code> tags, if you want every third <code><a></code> tag, then <pre class="prettyprint"><code>for row in table.findAll("a")[::3]: print(row.get_text()) </code></pre> suffices: <pre class="prettyprint"><code>127.0.0.1 192.168.0.1 255.255.255.0 </code></pre> On the other hand, if the occurrence of <code><a></code> tags is not so regular and you only want that <code><a></code> tags with no previous sibling (such as, but not limited to <code><img></code>), then <pre class="prettyprint"><code>for row in table.findAll("a"): sibling = row.findPreviousSibling() if sibling is None: print(row.get_text()) </code></pre> would work. <hr> If you have lxml, the criteria can be expressed more succinctly using XPath: <pre class="prettyprint"><code>import lxml.html as LH doc = LH.parse("data.htm") ips = doc.xpath('//table[@class="mainTable"]//td/a[not(preceding-sibling::img)]/text()') print(ips) </code></pre> The XPath used above has the following meaning: <pre class="prettyprint"><code>//table select all <table> tags [@class="mainTable"] that have a class="mainTable" attribute // from these tags select descendants td/a which are td tags with a child <a> tag [not(preceding-sibling::img)] such that it does not have a preceding sibling <img> tag /text() return the text of the <a> tag </code></pre> It does take a little time to learn XPath, but once you learn it you may never want to use BeautifulSoup again.

Using BeautifulSoup To Extract Specific TD Table Elements Text?

Tags:

python

html

beautifulsoup

I trying to extract IP Addresses from a autogenerated HTML table using the BeautifulSoup library and im having a little trouble.

The HTML is structured like so:

<html>
<body>
    <table class="mainTable">
    <thead>
        <tr>
            <th>IP</th>
            <th>Country</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td><a href="hello.html">127.0.0.1<a></td>
            <td><img src="uk.gif" /><a href="uk.com">uk</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">192.168.0.1<a></td>
            <td><img src="uk.gif" /><a href="us.com">us</a></td>
        </tr>
        <tr>
            <td><a href="hello.html">255.255.255.0<a></td>
            <td><img src="uk.gif" /><a href="br.com">br</a></td>
        </tr>
    </tbody>
</table>

The small code below extracts the text from the two td rows but i only need the IP data, not the IP and Country data:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.htm"))

table = soup.find('table', {'class': 'mainTable'})
for row in table.findAll("a"):
print(row.text)

this outputs:

127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br

What i need is the IP table.tbody.tr.td.a elements text and not the country table.tbody.tr.td.img.a elements.

Are there any experienced users of BeautifulSoup who would have any inkling on how to to this selection and extraction.

Thanks.

900

asked Mar 30 '14 15:03

Pike Man

Video Answer

4 Answers

This gives you the right list:

>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[<a href="hello.html">127.0.0.1<a></a></a>, <a></a>, <a href="hello.html">192.168.0.1<a></a></a>, <a></a>, <a href="hello.html">255.255.255.0<a></a></a>, <a></a>]

just apply .text on the elements of this list.

There are multiple empty <a></a> tags in above list because the <a> tags in the html are not closed properly. To get rid of them, you may use

pred = lambda tag: tag.parent.find('img') is None and tag.text

and ultimately:

>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']

116

answered Oct 12 '22 22:10

behzad.nouri

You can use a little regular expression for extracting the ip address. BeautifulSoup with regular expression is a nice combination for scraping.

ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
    if ip_pat.match(row.text):
        print(row.text)

answered Oct 12 '22 20:10

salmanwahed

Search just first <td> for each row in tbody:

# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]

or maybe more readable:

rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]

answered Oct 12 '22 21:10

m.wasowski

First note that the HTML is not well-formed. It is not closing the a tags. There are two <a> tags started here:

<a href="hello.html">127.0.0.1<a>

If you print table you'll see BeautifulSoup is parsing the HTML as:

<td>
<a href="hello.html">127.0.0.1</a><a></a>
</td>
...

Each a is followed by an empty a.

Given the presence of those extra <a> tags, if you want every third <a> tag, then

for row in table.findAll("a")[::3]:
    print(row.get_text())

suffices:

127.0.0.1
192.168.0.1
255.255.255.0

On the other hand, if the occurrence of <a> tags is not so regular and you only want that <a> tags with no previous sibling (such as, but not limited to <img>), then

for row in table.findAll("a"):
    sibling = row.findPreviousSibling()
    if sibling is None:
        print(row.get_text())

would work.

If you have lxml, the criteria can be expressed more succinctly using XPath:

import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[@class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)

The XPath used above has the following meaning:

//table                            select all <table> tags
    [@class="mainTable"]           that have a class="mainTable" attribute
//                                 from these tags select descendants
  td/a                             which are td tags with a child <a> tag
    [not(preceding-sibling::img)]  such that it does not have a preceding sibling <img> tag
    /text()                        return the text of the <a> tag

It does take a little time to learn XPath, but once you learn it you may never want to use BeautifulSoup again.