Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create my own datasets using in scikit-learn?

I want to create my own datasets, and use it in scikit-learn. Scikit-learn has some datasets like 'The Boston Housing Dataset' (.csv), user can use it by:

from sklearn import datasets 
boston = datasets.load_boston()

and codes below can get the data and target of this dataset:

X = boston.data
y = boston.target

The question is how to create my own dataset and can be used in that way? Any answers is appreciated, Thanks!

like image 575
Yuedong HU Avatar asked Feb 24 '17 07:02

Yuedong HU


People also ask

How do I load a dataset in scikit-learn?

Downloading datasets from the openml.org repository. openml.org is a public repository for machine learning data and experiments, that allows everybody to upload open datasets. The sklearn. datasets package is able to download datasets from the repository using the function sklearn.


1 Answers

Here's a quick and dirty way to achieve what you intend:

my_datasets.py

import numpy as np
import csv
from sklearn.utils import Bunch

def load_my_fancy_dataset():
    with open(r'my_fancy_dataset.csv') as csv_file:
        data_reader = csv.reader(csv_file)
        feature_names = next(data_reader)[:-1]
        data = []
        target = []
        for row in data_reader:
            features = row[:-1]
            label = row[-1]
            data.append([float(num) for num in features])
            target.append(int(label))
        
        data = np.array(data)
        target = np.array(target)
    return Bunch(data=data, target=target, feature_names=feature_names)

my_fancy_dataset.csv

feature_1,feature_2,feature_3,class_label
5.9,1203,0.69,2
7.2,902,0.52,0
6.3,143,0.44,1
-2.6,291,0.15,1
1.8,486,0.37,0

Demo

In [12]: import my_datasets

In [13]: mfd = my_datasets.load_my_fancy_dataset()

In [14]: X = mfd.data

In [15]: y = mfd.target

In [16]: X
Out[16]: 
array([[ 5.900e+00,  1.203e+03,  6.900e-01],
       [ 7.200e+00,  9.020e+02,  5.200e-01],
       [ 6.300e+00,  1.430e+02,  4.400e-01],
       [-2.600e+00,  2.910e+02,  1.500e-01],
       [ 1.800e+00,  4.860e+02,  3.700e-01]])

In [17]: y
Out[17]: array([2, 0, 1, 1, 0])

In [18]: mfd.feature_names
Out[18]: ['feature_1', 'feature_2', 'feature_3']
like image 180
Tonechas Avatar answered Oct 16 '22 23:10

Tonechas