Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When Scikit linear models return negative value for score?

I'm new in machine learning, and trying to implement linear model estimators that provide Scikit to predict price of the used car. I used different combinations of linear models, like LinearRegression, Ridge, Lasso and Elastic Net, but all of them in most cases return negative score (-0.6 <= score <= 0.1).

Someone told me that this is because of multicollinearity problem, but I don't know how to solve it.

My sample code:

import numpy as np
import pandas as pd
from sklearn import linear_model
from sqlalchemy import create_engine
from sklearn.linear_model import Ridge

engine = create_engine('sqlite:///path-to-db')

query = "SELECT mileage, carcass, engine, transmission, state, drive, customs_cleared, price FROM cars WHERE mark='some mark' AND model='some model' AND year='some year'"
df = pd.read_sql_query(query, engine)
df = df.dropna()
df = df.reindex(np.random.permutation(df.index))

X_full = df[['mileage', 'carcass', 'engine', 'transmission', 'state', 'drive', 'customs_cleared']]
y_full = df['price']

n_train = -len(X_full)/5
X_train = X_full[:n_train]
X_test = X_full[n_train:]
y_train = y_full[:n_train]
y_test = y_full[n_train:]

predict = [200000, 0, 2.5, 0, 0, 2, 0] # parameters of the car to predict

model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
y_estimate = model.predict(X_test)

print("Residual sum of squares: %.2f" % np.mean((y_estimate - y_test) ** 2))
print("Variance score: %.2f" % model.score(X_test, y_test))
print("Predicted price: ", model.predict(predict))

Carcass, state, drive and customs cleared are numeric and represent types.

What is correct way to implement prediction? Maybe some data preprocessing or different algorithm.

Thanks for any advance!

like image 456
Shyngys Kassymov Avatar asked Jun 07 '15 10:06

Shyngys Kassymov


People also ask

Why is a linear regression score negative?

Interpreting Linear Regression Coefficients A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.

Can the linear regression score negative?

The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y , disregarding the input features, would get a score of 0.0.

What does a negative prediction score mean?

Negative predictive value is the proportion of the cases giving negative test results who are already healthy (3). It is the ratio of subjects truly diagnosed as negative to all those who had negative test results (including patients who were incorrectly diagnosed as healthy).

What does a negative R2 score mean?

In practice, R2 will be negative whenever your model's predictions are worse than a constant function that always predicts the mean of the data.


1 Answers

Given that you are using Ridge Regression, you should scale your variables using StandardScaler, or MinMaxScaler:

http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling

Perhaps using a Pipeline:

http://scikit-learn.org/stable/modules/pipeline.html#pipeline-chaining-estimators

If you were using vanilla Regression, scaling wouldn't matter; but with Ridge Regression, the regularization penalty term (alpha) will treat differently scaled variables differently. See this discussion on stats:

https://stats.stackexchange.com/questions/29781/when-should-you-center-your-data-when-should-you-standardize

like image 141
Andreus Avatar answered Oct 12 '22 12:10

Andreus