Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sample datasets in Pandas

When using R it's handy to load "practice" datasets using

data(iris) 

or

data(mtcars) 

Is there something similar for Pandas? I know I can load using any other method, just curious if there's anything builtin.

like image 297
canyon289 Avatar asked Feb 09 '15 19:02

canyon289


People also ask

How do I sample a dataset in pandas?

Given a dataframe with N rows, random Sampling extract X random rows from the dataframe, with X ≤ N. Python pandas provides a function, named sample() to perform random sampling. The number of samples to be extracted can be expressed in two alternative ways: specify the exact number of random rows to extract.


1 Answers

Since I originally wrote this answer, I have updated it with the many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I am already using (usually seaborn or pandas). If you need offline access, installing the data set with Quilt seems to be the only option.

Seaborn

The brilliant plotting package seaborn has several built-in sample data sets.

import seaborn as sns  iris = sns.load_dataset('iris') iris.head() 
   sepal_length  sepal_width  petal_length  petal_width species 0           5.1          3.5           1.4          0.2  setosa 1           4.9          3.0           1.4          0.2  setosa 2           4.7          3.2           1.3          0.2  setosa 3           4.6          3.1           1.5          0.2  setosa 4           5.0          3.6           1.4          0.2  setosa 

Pandas

If you do not want to import seaborn, but still want to access its sample data sets, you can use @andrewwowens's approach for the seaborn sample data:

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') 

Note that the sample data sets containing categorical columns have their column type modified by sns.load_dataset() and the result might not be the same by getting it from the url directly. The iris and tips sample data sets are also available in the pandas github repo here.

R sample datasets

Since any dataset can be read via pd.read_csv(), it is possible to access all R's sample data sets by copying the URLs from this R data set repository.

Additional ways of loading the R sample data sets include statsmodel

import statsmodels.api as sm  iris = sm.datasets.get_rdataset('iris').data 

and PyDataset

from pydataset import data  iris = data('iris') 

scikit-learn

scikit-learn returns sample data as numpy arrays rather than a pandas data frame.

from sklearn.datasets import load_iris  iris = load_iris() # `iris.data` holds the numerical values # `iris.feature_names` holds the numerical column names # `iris.target` holds the categorical (species) values (as ints) # `iris.target_names` holds the unique categorical names 

Quilt

Quilt is a dataset manager created to facilitate dataset management. It includes many common sample datasets, such as several from the uciml sample repository. The quick start page shows how to install and import the iris data set:

# In your terminal $ pip install quilt $ quilt install uciml/iris 

After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.

import quilt.data.uciml.iris as ir  iris = ir.tables.iris() 
   sepal_length  sepal_width  petal_length  petal_width        class 0           5.1          3.5           1.4          0.2  Iris-setosa 1           4.9          3.0           1.4          0.2  Iris-setosa 2           4.7          3.2           1.3          0.2  Iris-setosa 3           4.6          3.1           1.5          0.2  Iris-setosa 4           5.0          3.6           1.4          0.2  Iris-setosa 

Quilt also support dataset versioning and include a short description of each dataset.

like image 97
joelostblom Avatar answered Oct 12 '22 09:10

joelostblom