I would like to load a .csv
file from the web and convert it into a pandas.DataFrame
.
Here's my target page where I want to find a .csv
file:
https://vincentarelbundock.github.io/Rdatasets/datasets.html
How can I load a .csv
file of corresponding items from the webpage and convert into a panda.DataFrame
?
In addition it would be great if I could get the addresses of the .csv
files from the web page also.
This would allow me to create a function to convert an item name from the target page, that would return the .csv
file address like:
def data(item):
file = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/'+str(item)+'.csv'
However, the addresses of the csv files in the webpage are not all the same pattern.
For example,
https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Cuckoo.csv
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/cars.csv
Quite a lot of files are in different directory, so I need to search 'items' and get the address of the corresponding csv file.
Pandas can read the csv
directly from the http link:
Example;
df = pd.read_csv(
'https://vincentarelbundock.github.io/Rdatasets/'
'csv/datasets/OrchardSprays.csv')
print(df)
Results:
Unnamed: 0 decrease rowpos colpos treatment
0 1 57 1 1 D
1 2 95 2 1 E
.. ... ... ... ... ...
62 63 3 7 8 A
63 64 19 8 8 C
[64 rows x 5 columns]
Getting links via scraping:
To get the links themselves from the front page, we can also use pandas
to do web scraping for data. Something like:
base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'
import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
header=0, flavor='html5lib')[0]
Will return the data in the table on the page. Unfortunately for our uses here, this does not work because pandas
grabs the text on the page, not the links.
Monkey Patching the scraper to get Links:
To get the URLs, we can monkey patch the library like:
def _text_getter(self, obj):
text = obj.text
if text.strip() in ('CSV', 'DOC'):
try:
text = base_url + obj.find('a')['href']
except (TypeError, KeyError):
pass
return text
from pandas.io.html import _BeautifulSoupHtml5LibFrameParser as bsp
bsp._text_getter = _text_getter
Test Code:
base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'
import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
header=0, flavor='html5lib')[0]
for row in df.head().iterrows():
print('%-14s: %s' % (row[1].Item, row[1].csv))
Results:
AirPassengers: https://vincentarelbundock.github.io/Rdatasets/csv/datasets/AirPassengers.csv
BJsales : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BJsales.csv
BOD : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BOD.csv
CO2 : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/CO2.csv
Formaldehyde : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Formaldehyde.csv
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With