For quick testing, debugging, creating portable examples, and benchmarking, R has available to it a large number of data sets (in the Base R <code>datasets</code> package). The command <code>library(help="datasets")</code> at the R prompt describes nearly 100 historical datasets, each of which have associated descriptions and metadata. Is there anything like this for Python?

You can use <code>rpy2</code> package to access all R datasets from Python. Set up the interface: <pre class="prettyprint"><code>>>> from rpy2.robjects import r, pandas2ri >>> def data(name): ... return pandas2ri.ri2py(r[name]) </code></pre> Then call <code>data()</code> with any dataset's name of the available datasets (just like in <code>R</code>) <pre class="prettyprint"><code>>>> df = data('iris') >>> df.describe() Sepal.Length Sepal.Width Petal.Length Petal.Width count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.057333 3.758000 1.199333 std 0.828066 0.435866 1.765298 0.762238 min 4.300000 2.000000 1.000000 0.100000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000 </code></pre> To see a list of the available datasets with a description for each: <pre class="prettyprint"><code>>>> print(r.data()) </code></pre> Note: rpy2 requires <code>R</code> installation with setting <code>R_HOME</code> variable, and <code>pandas</code> must be installed as well. <h3>UPDATE</h3> I just created PyDataset, which is a simple module to make loading a dataset from Python as easy as <code>R</code>'s (and it does not require <code>R</code> installation, only <code>pandas</code>). To start using it, install the module: <pre class="prettyprint"><code>$ pip install pydataset </code></pre> Then just load up any dataset you wish (currently around 757 datasets available): <pre class="prettyprint"><code>from pydataset import data titanic = data('titanic') </code></pre>

Are there any example data sets for Python?

Tags:

python

dataset

For quick testing, debugging, creating portable examples, and benchmarking, R has available to it a large number of data sets (in the Base R datasets package). The command library(help="datasets") at the R prompt describes nearly 100 historical datasets, each of which have associated descriptions and metadata.

Is there anything like this for Python?

569

asked May 16 '13 05:05

a different ben

2 Answers

You can use rpy2 package to access all R datasets from Python.

Set up the interface:

>>> from rpy2.robjects import r, pandas2ri >>> def data(name):  ...    return pandas2ri.ri2py(r[name])

Then call data() with any dataset's name of the available datasets (just like in R)

>>> df = data('iris') >>> df.describe()        Sepal.Length  Sepal.Width  Petal.Length  Petal.Width count    150.000000   150.000000    150.000000   150.000000 mean       5.843333     3.057333      3.758000     1.199333 std        0.828066     0.435866      1.765298     0.762238 min        4.300000     2.000000      1.000000     0.100000 25%        5.100000     2.800000      1.600000     0.300000 50%        5.800000     3.000000      4.350000     1.300000 75%        6.400000     3.300000      5.100000     1.800000 max        7.900000     4.400000      6.900000     2.500000

To see a list of the available datasets with a description for each:

>>> print(r.data())

Note: rpy2 requires R installation with setting R_HOME variable, and pandas must be installed as well.

UPDATE

I just created PyDataset, which is a simple module to make loading a dataset from Python as easy as R's (and it does not require R installation, only pandas).

To start using it, install the module:

$ pip install pydataset

Then just load up any dataset you wish (currently around 757 datasets available):

from pydataset import data  titanic = data('titanic')

answered Sep 22 '22 18:09

Aziz Alto

There are also datasets available from the Scikit-Learn library.

from sklearn import datasets

There are multiple datasets within this package. Some of the Toy Datasets are:

load_boston()          Load and return the boston house-prices dataset (regression). load_iris()            Load and return the iris dataset (classification). load_diabetes()        Load and return the diabetes dataset (regression). load_digits([n_class]) Load and return the digits dataset (classification). load_linnerud()        Load and return the linnerud dataset (multivariate regression).

answered Sep 22 '22 18:09

tmthydvnprt

Related questions
                            
                                Where is the Google App Engine SDK path on OSX?
                            
                                Pandas reset index is not taking effect [duplicate]
                            
                                Python - How to convert JSON File to Dataframe
                            
                                OrderedDict Isn't Ordered?
                            
                                What does {0} mean in this Python string?
                            
                                Removing help_text from Django UserCreateForm
                            
                                Difference between consecutive elements in list [duplicate]
                            
                                What is the purpose of Python's itertools.repeat?
                            
                                Visual Studio Code - removing pylint
                            
                                Find non-common elements in lists
                            
                                AttributeError: 'module' object has no attribute 'setdefaultencoding'
                            
                                How to change the starting index of iterrows()?
                            
                                How to get the signed integer value of a long in python?
                            
                                Installing OpenCV on Windows 7 for Python 2.7
                            
                                Run unittest from a Python program via a command-line option
                            
                                How to get latest offset for a partition for a kafka topic?
                            
                                Loading QtDesigner's .ui files in PySide
                            
                                Django POST URL error
                            
                                Get dict key by max value [duplicate]
                            
                                Installing NumPy and SciPy on 64-bit Windows (with Pip)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With