<p>When using R it's handy to load "practice" datasets using </p> <pre class="prettyprint"><code>data(iris) </code></pre> <p>or</p> <pre class="prettyprint"><code>data(mtcars) </code></pre> <p>Is there something similar for Pandas? I know I can load using any other method, just curious if there's anything builtin.</p>

<p>Since I originally wrote this answer, I have updated it with the many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I am already using (usually seaborn or pandas). If you need offline access, installing the data set with Quilt seems to be the only option.</p> <h3>Seaborn</h3> <p>The brilliant plotting package <code>seaborn</code> has several built-in sample data sets. </p> <pre class="prettyprint"><code>import seaborn as sns iris = sns.load_dataset('iris') iris.head() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code> sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa </code></pre> <h3>Pandas</h3> <p>If you do not want to import <code>seaborn</code>, but still want to access <a href="https://github.com/mwaskom/seaborn-data" rel="noreferrer">its sample data sets</a>, you can use @andrewwowens's approach for the seaborn sample data:</p> <pre class="prettyprint"><code>iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv') </code></pre> <p>Note that the sample data sets containing categorical columns have their <a href="https://github.com/mwaskom/seaborn/blob/10bdb18f47bb5fc0a30d34954ff6f174b4cf5881/seaborn/utils.py#L396" rel="noreferrer">column type modified by <code>sns.load_dataset()</code></a> and the result might not be the same by getting it from the url directly. The iris and tips sample data sets are also available in the pandas github repo here.</p> <h3>R sample datasets</h3> <p>Since any dataset can be read via <code>pd.read_csv()</code>, it is possible to access all R's sample data sets by copying the URLs from <a href="https://vincentarelbundock.github.io/Rdatasets/datasets.html" rel="noreferrer">this R data set repository</a>.</p> <p>Additional ways of loading the R sample data sets include <code>statsmodel</code></p> <pre class="prettyprint"><code>import statsmodels.api as sm iris = sm.datasets.get_rdataset('iris').data </code></pre> <p>and <code>PyDataset</code></p> <pre class="prettyprint"><code>from pydataset import data iris = data('iris') </code></pre> <h3>scikit-learn</h3> <p><code>scikit-learn</code> returns sample data as numpy arrays rather than a pandas data frame.</p> <pre class="prettyprint"><code>from sklearn.datasets import load_iris iris = load_iris() # `iris.data` holds the numerical values # `iris.feature_names` holds the numerical column names # `iris.target` holds the categorical (species) values (as ints) # `iris.target_names` holds the unique categorical names </code></pre> <h3>Quilt</h3> <p>Quilt is a dataset manager created to facilitate dataset management. It includes many common sample datasets, such as several from the <a href="https://archive.ics.uci.edu/ml/index.php" rel="noreferrer">uciml sample repository</a>. The <a href="https://docs.quiltdata.com/get-started/quick-start" rel="noreferrer">quick start page</a> shows how to install and import the iris data set:</p> <pre class="prettyprint"><code># In your terminal $ pip install quilt $ quilt install uciml/iris </code></pre> <p>After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.</p> <pre class="prettyprint"><code>import quilt.data.uciml.iris as ir iris = ir.tables.iris() </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code> sepal_length sepal_width petal_length petal_width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa </code></pre> <p>Quilt also support dataset versioning and include a <a href="https://quiltdata.com/package/uciml/iris/" rel="noreferrer">short description</a> of each dataset.</p>

Sample datasets in Pandas

Tags:

python

pandas

dataset

sample-data

When using R it's handy to load "practice" datasets using

data(iris)

data(mtcars)

Is there something similar for Pandas? I know I can load using any other method, just curious if there's anything builtin.

297

asked Feb 09 '15 19:02

canyon289

1 Answers

Since I originally wrote this answer, I have updated it with the many ways that are now available for accessing sample data sets in Python. Personally, I tend to stick with whatever package I am already using (usually seaborn or pandas). If you need offline access, installing the data set with Quilt seems to be the only option.

Seaborn

The brilliant plotting package seaborn has several built-in sample data sets.

import seaborn as sns  iris = sns.load_dataset('iris') iris.head()

   sepal_length  sepal_width  petal_length  petal_width species 0           5.1          3.5           1.4          0.2  setosa 1           4.9          3.0           1.4          0.2  setosa 2           4.7          3.2           1.3          0.2  setosa 3           4.6          3.1           1.5          0.2  setosa 4           5.0          3.6           1.4          0.2  setosa

Pandas

If you do not want to import seaborn, but still want to access its sample data sets, you can use @andrewwowens's approach for the seaborn sample data:

iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

Note that the sample data sets containing categorical columns have their column type modified by sns.load_dataset() and the result might not be the same by getting it from the url directly. The iris and tips sample data sets are also available in the pandas github repo here.

R sample datasets

Since any dataset can be read via pd.read_csv(), it is possible to access all R's sample data sets by copying the URLs from this R data set repository.

Additional ways of loading the R sample data sets include statsmodel

import statsmodels.api as sm  iris = sm.datasets.get_rdataset('iris').data

and PyDataset

from pydataset import data  iris = data('iris')

scikit-learn

scikit-learn returns sample data as numpy arrays rather than a pandas data frame.

from sklearn.datasets import load_iris  iris = load_iris() # `iris.data` holds the numerical values # `iris.feature_names` holds the numerical column names # `iris.target` holds the categorical (species) values (as ints) # `iris.target_names` holds the unique categorical names

Quilt

Quilt is a dataset manager created to facilitate dataset management. It includes many common sample datasets, such as several from the uciml sample repository. The quick start page shows how to install and import the iris data set:

# In your terminal $ pip install quilt $ quilt install uciml/iris

After installing a dataset, it is accessible locally, so this is the best option if you want to work with the data offline.

import quilt.data.uciml.iris as ir  iris = ir.tables.iris()

   sepal_length  sepal_width  petal_length  petal_width        class 0           5.1          3.5           1.4          0.2  Iris-setosa 1           4.9          3.0           1.4          0.2  Iris-setosa 2           4.7          3.2           1.3          0.2  Iris-setosa 3           4.6          3.1           1.5          0.2  Iris-setosa 4           5.0          3.6           1.4          0.2  Iris-setosa

Quilt also support dataset versioning and include a short description of each dataset.

answered Oct 12 '22 09:10

joelostblom

Related questions
                            
                                Debugging (displaying) SQL command sent to the db by SQLAlchemy
                            
                                Shuffle two list at once with same order
                            
                                pyvenv-3.4 returned non-zero exit status 1
                            
                                convert nan value to zero
                            
                                How to set adaptive learning rate for GradientDescentOptimizer?
                            
                                Amazon S3 boto - how to delete folder?
                            
                                Why did pip upgrade from version 10 to version 18?
                            
                                Pairs from single list
                            
                                Good uses for mutable function argument default values?
                            
                                How can I distribute python programs?
                            
                                Preferred (or most common) file extension for a Python pickle
                            
                                Local variables in nested functions
                            
                                Is there an equivalent to CTRL+C in IPython Notebook in Firefox to break cells that are running?
                            
                                IndexError: tuple index out of range when using py2exe
                            
                                Explain the "setUp" and "tearDown" Python methods used in test cases
                            
                                Dump to JSON adds additional double quotes and escaping of quotes
                            
                                How I can I lazily read multiple JSON values from a file/stream in Python?
                            
                                How can mypy ignore a single line in a source file?
                            
                                How to save all the variables in the current python session?
                            
                                Python constructors and __init__

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With