I trying to extract IP Addresses from a autogenerated HTML table using the BeautifulSoup library and im having a little trouble.
The HTML is structured like so:
<html>
<body>
<table class="mainTable">
<thead>
<tr>
<th>IP</th>
<th>Country</th>
</tr>
</thead>
<tbody>
<tr>
<td><a href="hello.html">127.0.0.1<a></td>
<td><img src="uk.gif" /><a href="uk.com">uk</a></td>
</tr>
<tr>
<td><a href="hello.html">192.168.0.1<a></td>
<td><img src="uk.gif" /><a href="us.com">us</a></td>
</tr>
<tr>
<td><a href="hello.html">255.255.255.0<a></td>
<td><img src="uk.gif" /><a href="br.com">br</a></td>
</tr>
</tbody>
</table>
The small code below extracts the text from the two td rows but i only need the IP data, not the IP and Country data:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("data.htm"))
table = soup.find('table', {'class': 'mainTable'})
for row in table.findAll("a"):
print(row.text)
this outputs:
127.0.0.1
uk
192.168.0.1
us
255.255.255.0
br
What i need is the IP table.tbody.tr.td.a
elements text and not the country table.tbody.tr.td.img.a
elements.
Are there any experienced users of BeautifulSoup who would have any inkling on how to to this selection and extraction.
Thanks.
BeautifulSoup has a limited support for CSS selectors, but covers most commonly used ones. Use select() method to find multiple elements and select_one() to find a single element.
Step-by-step Approach Step 1: The first step will be for scraping we need to import beautifulsoup module and get the request of the website we need to import the requests module. Step 2: The second step will be to request the URL call get method.
This gives you the right list:
>>> pred = lambda tag: tag.parent.find('img') is None
>>> list(filter(pred, soup.find('tbody').find_all('a')))
[<a href="hello.html">127.0.0.1<a></a></a>, <a></a>, <a href="hello.html">192.168.0.1<a></a></a>, <a></a>, <a href="hello.html">255.255.255.0<a></a></a>, <a></a>]
just apply .text
on the elements of this list.
There are multiple empty <a></a>
tags in above list because the <a>
tags in the html are not closed properly. To get rid of them, you may use
pred = lambda tag: tag.parent.find('img') is None and tag.text
and ultimately:
>>> [tag.text for tag in filter(pred, soup.find('tbody').find_all('a'))]
['127.0.0.1', '192.168.0.1', '255.255.255.0']
You can use a little regular expression for extracting the ip address. BeautifulSoup with regular expression is a nice combination for scraping.
ip_pat = re.compile(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$")
for row in table.findAll("a"):
if ip_pat.match(row.text):
print(row.text)
Search just first <td>
for each row in tbody
:
# html should contain page content:
[row.find('td').getText() for row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
or maybe more readable:
rows = [row in bs4.BeautifulSoup(html).find('tbody').find_all('tr')]
iplist = [row.find('td').getText() for row in rows]
First note that the HTML is not well-formed. It is not closing the a
tags. There are two <a>
tags started here:
<a href="hello.html">127.0.0.1<a>
If you print table
you'll see BeautifulSoup is parsing the HTML as:
<td>
<a href="hello.html">127.0.0.1</a><a></a>
</td>
...
Each a
is followed by an empty a
.
Given the presence of those extra <a>
tags, if you want every third <a>
tag, then
for row in table.findAll("a")[::3]:
print(row.get_text())
suffices:
127.0.0.1
192.168.0.1
255.255.255.0
On the other hand, if the occurrence of <a>
tags is not so regular and you only want that <a>
tags with no previous sibling (such as, but not limited to <img>
), then
for row in table.findAll("a"):
sibling = row.findPreviousSibling()
if sibling is None:
print(row.get_text())
would work.
If you have lxml, the criteria can be expressed more succinctly using XPath:
import lxml.html as LH
doc = LH.parse("data.htm")
ips = doc.xpath('//table[@class="mainTable"]//td/a[not(preceding-sibling::img)]/text()')
print(ips)
The XPath used above has the following meaning:
//table select all <table> tags
[@class="mainTable"] that have a class="mainTable" attribute
// from these tags select descendants
td/a which are td tags with a child <a> tag
[not(preceding-sibling::img)] such that it does not have a preceding sibling <img> tag
/text() return the text of the <a> tag
It does take a little time to learn XPath, but once you learn it you may never want to use BeautifulSoup again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With