<p>I want to create my own datasets, and use it in scikit-learn. Scikit-learn has some datasets like 'The Boston Housing Dataset' (.csv), user can use it by:</p> <pre class="prettyprint"><code>from sklearn import datasets boston = datasets.load_boston() </code></pre> <p>and codes below can get the <code>data</code> and <code>target</code> of this dataset:</p> <pre class="prettyprint"><code>X = boston.data y = boston.target </code></pre> <p>The question is how to create my own dataset and can be used in that way? Any answers is appreciated, Thanks!</p>

<p>Here's a quick and dirty way to achieve what you intend:</p> <h4><code>my_datasets.py</code></h4> <pre class="prettyprint lang-py prettyprint-override"><code>import numpy as np import csv from sklearn.utils import Bunch def load_my_fancy_dataset(): with open(r'my_fancy_dataset.csv') as csv_file: data_reader = csv.reader(csv_file) feature_names = next(data_reader)[:-1] data = [] target = [] for row in data_reader: features = row[:-1] label = row[-1] data.append([float(num) for num in features]) target.append(int(label)) data = np.array(data) target = np.array(target) return Bunch(data=data, target=target, feature_names=feature_names) </code></pre> <h4><code>my_fancy_dataset.csv</code></h4> <pre class="prettyprint"><code>feature_1,feature_2,feature_3,class_label 5.9,1203,0.69,2 7.2,902,0.52,0 6.3,143,0.44,1 -2.6,291,0.15,1 1.8,486,0.37,0 </code></pre> <h4>Demo</h4> <pre class="prettyprint lang-py prettyprint-override"><code>In [12]: import my_datasets In [13]: mfd = my_datasets.load_my_fancy_dataset() In [14]: X = mfd.data In [15]: y = mfd.target In [16]: X Out[16]: array([[ 5.900e+00, 1.203e+03, 6.900e-01], [ 7.200e+00, 9.020e+02, 5.200e-01], [ 6.300e+00, 1.430e+02, 4.400e-01], [-2.600e+00, 2.910e+02, 1.500e-01], [ 1.800e+00, 4.860e+02, 3.700e-01]]) In [17]: y Out[17]: array([2, 0, 1, 1, 0]) In [18]: mfd.feature_names Out[18]: ['feature_1', 'feature_2', 'feature_3'] </code></pre>

How to create my own datasets using in scikit-learn?

Tags:

python

csv

machine-learning

dataset

scikit-learn

I want to create my own datasets, and use it in scikit-learn. Scikit-learn has some datasets like 'The Boston Housing Dataset' (.csv), user can use it by:

from sklearn import datasets 
boston = datasets.load_boston()

and codes below can get the data and target of this dataset:

X = boston.data
y = boston.target

The question is how to create my own dataset and can be used in that way? Any answers is appreciated, Thanks!

575

asked Feb 24 '17 07:02

Yuedong HU

1 Answers

Here's a quick and dirty way to achieve what you intend:

`my_datasets.py`

import numpy as np
import csv
from sklearn.utils import Bunch

def load_my_fancy_dataset():
    with open(r'my_fancy_dataset.csv') as csv_file:
        data_reader = csv.reader(csv_file)
        feature_names = next(data_reader)[:-1]
        data = []
        target = []
        for row in data_reader:
            features = row[:-1]
            label = row[-1]
            data.append([float(num) for num in features])
            target.append(int(label))
        
        data = np.array(data)
        target = np.array(target)
    return Bunch(data=data, target=target, feature_names=feature_names)

`my_fancy_dataset.csv`

feature_1,feature_2,feature_3,class_label
5.9,1203,0.69,2
7.2,902,0.52,0
6.3,143,0.44,1
-2.6,291,0.15,1
1.8,486,0.37,0

Demo

In [12]: import my_datasets

In [13]: mfd = my_datasets.load_my_fancy_dataset()

In [14]: X = mfd.data

In [15]: y = mfd.target

In [16]: X
Out[16]: 
array([[ 5.900e+00,  1.203e+03,  6.900e-01],
       [ 7.200e+00,  9.020e+02,  5.200e-01],
       [ 6.300e+00,  1.430e+02,  4.400e-01],
       [-2.600e+00,  2.910e+02,  1.500e-01],
       [ 1.800e+00,  4.860e+02,  3.700e-01]])

In [17]: y
Out[17]: array([2, 0, 1, 1, 0])

In [18]: mfd.feature_names
Out[18]: ['feature_1', 'feature_2', 'feature_3']

180

answered Oct 16 '22 23:10

Tonechas

Related questions
                            
                                masking a series with a boolean array
                            
                                Tracking down implicit unicode conversions in Python 2
                            
                                Copy matplotlib artist
                            
                                Playing video in Gtk in a window with a menubar
                            
                                RuntimeError: the sip module implements API v11.0 to v11.2 but the PyQt5.QtCore module requires API v11.3
                            
                                How to extract the data from an ImmutableMultiDict
                            
                                WebDriverException: Message: 'phantomjs' executable may have wrong permissions
                            
                                Is it possible to construct a Pandas Series which auto-interpolates?
                            
                                How to create Socket.io client in Python to talk to a Sails server
                            
                                Pandas/Excel: Any way to encode the ALT-ENTER / CHAR(10) line break into data when calling DataFrame.to_excel()?
                            
                                python bytes(some_string, 'UTF-8') and str(some_string, 'UTF-8')
                            
                                AttributeError: module 'tensorflow.contrib.learn' has no attribute 'TensorFlowDNNClassifier'
                            
                                Add custom page to django admin without a model
                            
                                Finding the extent of a matplotlib plot (including ticklabels) in axis coordinates
                            
                                Remove scrollbar to show full table
                            
                                Clear sys.argv in python
                            
                                Python asyncio task ordering
                            
                                Is it possible to wait until `.persist()` finishes caching in dask?
                            
                                pip not installing numba/llvmlite properly within conda environment
                            
                                How to set a dict value using another key of the same dict [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With