I want to create my own datasets, and use it in scikit-learn. Scikit-learn has some datasets like 'The Boston Housing Dataset' (.csv), user can use it by:
from sklearn import datasets
boston = datasets.load_boston()
and codes below can get the data
and target
of this dataset:
X = boston.data
y = boston.target
The question is how to create my own dataset and can be used in that way? Any answers is appreciated, Thanks!
Downloading datasets from the openml.org repository. openml.org is a public repository for machine learning data and experiments, that allows everybody to upload open datasets. The sklearn. datasets package is able to download datasets from the repository using the function sklearn.
Here's a quick and dirty way to achieve what you intend:
my_datasets.py
import numpy as np
import csv
from sklearn.utils import Bunch
def load_my_fancy_dataset():
with open(r'my_fancy_dataset.csv') as csv_file:
data_reader = csv.reader(csv_file)
feature_names = next(data_reader)[:-1]
data = []
target = []
for row in data_reader:
features = row[:-1]
label = row[-1]
data.append([float(num) for num in features])
target.append(int(label))
data = np.array(data)
target = np.array(target)
return Bunch(data=data, target=target, feature_names=feature_names)
my_fancy_dataset.csv
feature_1,feature_2,feature_3,class_label
5.9,1203,0.69,2
7.2,902,0.52,0
6.3,143,0.44,1
-2.6,291,0.15,1
1.8,486,0.37,0
In [12]: import my_datasets
In [13]: mfd = my_datasets.load_my_fancy_dataset()
In [14]: X = mfd.data
In [15]: y = mfd.target
In [16]: X
Out[16]:
array([[ 5.900e+00, 1.203e+03, 6.900e-01],
[ 7.200e+00, 9.020e+02, 5.200e-01],
[ 6.300e+00, 1.430e+02, 4.400e-01],
[-2.600e+00, 2.910e+02, 1.500e-01],
[ 1.800e+00, 4.860e+02, 3.700e-01]])
In [17]: y
Out[17]: array([2, 0, 1, 1, 0])
In [18]: mfd.feature_names
Out[18]: ['feature_1', 'feature_2', 'feature_3']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With