Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

My accuracy is at 0.0 and I don't know why?

I am getting an accuracy of 0.0. I am using the boston housing dataset.

Here is my code:

import sklearn
from sklearn import datasets
from sklearn import svm, metrics
from sklearn import linear_model, preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
boston = datasets.load_boston()

x = boston.data
y = boston.target

train_data, test_data, train_label, test_label = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

model = KNeighborsClassifier()

lab_enc = preprocessing.LabelEncoder()
train_label_encoded = lab_enc.fit_transform(train_label)
test_label_encoded = lab_enc.fit_transform(test_label)

model.fit(train_data, train_label_encoded)
predicted = model.predict(test_data)
accuracy = model.score(test_data, test_label_encoded)
print(accuracy)

How can I increase the accuracy on this dataset?

like image 341
Paul McBurney Avatar asked Nov 07 '22 12:11

Paul McBurney


1 Answers

Boston dataset is for regression problems. Definition in the docs:

Load and return the boston house-prices dataset (regression).

So, it does not make sense if you use an ordinary encoding like the labels are not samples from a continuous data. For example, you encode 12.3 and 12.4 to completely different labels but they are pretty close to each other, and you evaluate the result wrong if the classifier predicts 12.4 when the real target is 12.3, but this is not a binary situation. In classification, the prediction is whether correct or not, but in regression it is calculated in a different way such as mean square error.

This part is not necessary, but I would like to give you an example for the same dataset and source code. With a simple idea of rounding the labels towards zero(to the nearest integer to zero) will give you some intuition.

5.0-5.9 -> 5
6.0-6.9 -> 6
...
50.0-50.9 -> 50

Let's change your code a little bit.

import numpy as np

def encode_func(labels):
    return np.array([int(l) for l in labels])

...

train_label_encoded = encode_func(train_label)
test_label_encoded = encode_func(test_label)

The output will be around 10%.

like image 118
Alperen Cetin Avatar answered Nov 17 '22 12:11

Alperen Cetin