Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading web data file to pandas dataframe

Tags:

python

pandas

I would like to load a .csv file from the web and convert it into a pandas.DataFrame.

Here's my target page where I want to find a .csv file:

https://vincentarelbundock.github.io/Rdatasets/datasets.html

How can I load a .csv file of corresponding items from the webpage and convert into a panda.DataFrame?

In addition it would be great if I could get the addresses of the .csv files from the web page also.

This would allow me to create a function to convert an item name from the target page, that would return the .csv file address like:

def data(item): 
    file = 'https://vincentarelbundock.github.io/Rdatasets/csv/datasets/'+str(item)+'.csv'

However, the addresses of the csv files in the webpage are not all the same pattern.

For example,

https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Cuckoo.csv 
https://vincentarelbundock.github.io/Rdatasets/csv/datasets/cars.csv

Quite a lot of files are in different directory, so I need to search 'items' and get the address of the corresponding csv file.

like image 969
Wookeun Lee Avatar asked Dec 23 '22 18:12

Wookeun Lee


1 Answers

Pandas can read the csv directly from the http link:

Example;

df = pd.read_csv(
    'https://vincentarelbundock.github.io/Rdatasets/'
    'csv/datasets/OrchardSprays.csv')
print(df)

Results:

    Unnamed: 0  decrease  rowpos  colpos treatment
0            1        57       1       1         D
1            2        95       2       1         E
..         ...       ...     ...     ...       ...
62          63         3       7       8         A
63          64        19       8       8         C

[64 rows x 5 columns]

Getting links via scraping:

To get the links themselves from the front page, we can also use pandas to do web scraping for data. Something like:

base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'

import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
                  header=0, flavor='html5lib')[0]

Will return the data in the table on the page. Unfortunately for our uses here, this does not work because pandas grabs the text on the page, not the links.

Monkey Patching the scraper to get Links:

To get the URLs, we can monkey patch the library like:

def _text_getter(self, obj):
    text = obj.text
    if text.strip() in ('CSV', 'DOC'):
        try:
            text = base_url + obj.find('a')['href']
        except (TypeError, KeyError):
            pass
    return text

from pandas.io.html import _BeautifulSoupHtml5LibFrameParser as bsp
bsp._text_getter = _text_getter

Test Code:

base_url = 'https://vincentarelbundock.github.io/Rdatasets/'
url = base_url + 'datasets.html'

import pandas as pd
df = pd.read_html(url, attrs={'class': 'dataframe'},
                  header=0, flavor='html5lib')[0]

for row in df.head().iterrows():
    print('%-14s: %s' % (row[1].Item, row[1].csv))

Results:

AirPassengers: https://vincentarelbundock.github.io/Rdatasets/csv/datasets/AirPassengers.csv
BJsales      : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BJsales.csv
BOD          : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/BOD.csv
CO2          : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/CO2.csv
Formaldehyde : https://vincentarelbundock.github.io/Rdatasets/csv/datasets/Formaldehyde.csv
like image 96
Stephen Rauch Avatar answered Jan 06 '23 23:01

Stephen Rauch