Loading web data file to pandas dataframe

Question

I would like to load a .csv file from the web and convert it into a pandas.DataFrame.

Here's my target page where I want to find a .csv file:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

How can I load a .csv file of corresponding items from the webpage and convert into a panda.DataFrame?

In addition it would be great if I could get the addresses of the .csv files from the web page also.

This would allow me to create a function to convert an item name from the target page, that would return the .csv file address like:

def data(item): 
    file = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/'+str(item)+'.csv'

However, the addresses of the csv files in the webpage are not all the same pattern.

For example,

https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Cuckoo.csv 
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/cars.csv

Quite a lot of files are in different directory, so I need to search 'items' and get the address of the corresponding csv file.

Stephen Rauch · Accepted Answer

Pandas can read the csv directly from the http link:

Example;

df = pd.read_csv(
    'https://vincentarelbundock.github.io/Rdatasets/'
    'csv/datasets/OrchardSprays.csv')
print(df)

Results:

    Unnamed: 0  decrease  rowpos  colpos treatment
0            1        57       1       1         D
1            2        95       2       1         E
..         ...       ...     ...     ...       ...
62          63         3       7       8         A
63          64        19       8       8         C

[64 rows x 5 columns]

Getting links via scraping:

To get the links themselves from the front page, we can also use pandas to do web scraping for data. Something like:

base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'

import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
                  header=0, flavor='html5lib')[0]

Will return the data in the table on the page. Unfortunately for our uses here, this does not work because pandas grabs the text on the page, not the links.

Monkey Patching the scraper to get Links:

To get the URLs, we can monkey patch the library like:

def _text_getter(self, obj):
    text = obj.text
    if text.strip() in ('CSV', 'DOC'):
        try:
            text = base_url + obj.find('a')['href']
        except (TypeError, KeyError):
            pass
    return text

from pandas.io.html import _BeautifulSoupHtml5LibFrameParser as bsp
bsp._text_getter = _text_getter

Test Code:

base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'

import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
                  header=0, flavor='html5lib')[0]

for row in df.head().iterrows():
    print('%-14s: %s' % (row[1].Item, row[1].csv))

Results:

AirPassengers: https://vincentarelbundock.github.io/Rdatasets/csv/datasets/AirPassengers.csv
BJsales      : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BJsales.csv
BOD          : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BOD.csv
CO2          : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/CO2.csv
Formaldehyde : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Formaldehyde.csv

Loading web data file to pandas dataframe

Tags:

python

pandas

Wookeun Lee

1 Answers

Stephen Rauch

Recent Activity

Donate For Us

Loading web data file to pandas dataframe

Tags:

python

pandas

Wookeun Lee

1 Answers

Stephen Rauch

Related questions

Recent Activity

Donate For Us