Scraping a web to get a table, using Beautiful soup
and Pandas
. One of the columns got some urls. When I pass html to pandas, href
are lost.
is there any way of preserving the url link just for that column?
Example data (edited for better suit ral case):
<html>
<body>
<table>
<tr>
<td>customer</td>
<td>country</td>
<td>area</td>
<td>website link</td>
</tr>
<tr>
<td>IBM</td>
<td>USA</td>
<td>EMEA</td>
<td><a href="http://www.ibm.com">IBM site</a></td>
</tr>
<tr>
<td>CISCO</td>
<td>USA</td>
<td>EMEA</td>
<td><a href="http://www.cisco.com">cisco site</a></td>
</tr>
<tr>
<td>unknown company</td>
<td>USA</td>
<td>EMEA</td>
<td></td>
</tr>
</table>
</body>
</html>
My python code:
file = open(url,"r")
soup = BeautifulSoup(file, 'lxml')
parsed_table = soup.find_all('table')[1]
df = pd.read_html(str(parsed_table),encoding='utf-8')[0]
df
Output (exported to CSV):
customer;country;area;website
IBM;USA;EMEA;IBM site
CISCO;USA;EMEA;cisco site
unknown company;USA;EMEA;
df output is ok but the link is lost. I need to preserve the link. The URL at least.
any hint?
pd.read_html
assumes the data you are interested in is in the text, not the tag attributes. However, it isn't hard to scrape the table yourself:
import bs4 as bs
import pandas as pd
with open(url,"r") as f:
soup = bs.BeautifulSoup(f, 'lxml')
parsed_table = soup.find_all('table')[1]
data = [[td.a['href'] if td.find('a') else
''.join(td.stripped_strings)
for td in row.find_all('td')]
for row in parsed_table.find_all('tr')]
df = pd.DataFrame(data[1:], columns=data[0])
print(df)
yields
customer country area website link
0 IBM USA EMEA http://www.ibm.com
1 CISCO USA EMEA http://www.cisco.com
2 unknown company USA EMEA
Just check if tag exists this way:
import numpy as np
with open(url,"r") as f:
sp = bs.BeautifulSoup(f, 'lxml')
tb = sp.find_all('table')[56]
df = pd.read_html(str(tb),encoding='utf-8', header=0)[0]
df['href'] = [np.where(tag.has_attr('href'),tag.get('href'),"no link") for tag in tb.find_all('a')]
Here is another way to do it if you have more than one link to grab from html table. Instead of making list comprehension I would rather go with separeated for loops so code is more readable to those who are new to python and it is easier to adjust code ore handle errors if they emerge. I hope it will help someone.
soup = BeautifulSoup(html, "lxml")
table = table.find('table')
thead = table.find('thead')
column_names = [th.text.strip() for th in thead.find_all('th')]
data = []
for row in table.find_all('tr'):
row_data = []
for td in row.find_all('td'):
td_check = td.find('a')
if td_check is not None:
link = td.a['href']
row_data.append(link)
else:
not_link = ''.join(td.stripped_strings)
if not_link == '':
not_link = None
row_data.append(not_link)
data.append(row_data)
df = pd.DataFrame(data[1:], columns=column_names)
df_dict = df.to_dict('records')
for row in df_dict:
print(row)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With