I'm new in machine learning, and trying to implement linear model estimators that provide Scikit to predict price of the used car. I used different combinations of linear models, like LinearRegression
, Ridge
, Lasso
and Elastic Net
, but all of them in most cases return negative score (-0.6 <= score <= 0.1).
Someone told me that this is because of multicollinearity problem, but I don't know how to solve it.
My sample code:
import numpy as np
import pandas as pd
from sklearn import linear_model
from sqlalchemy import create_engine
from sklearn.linear_model import Ridge
engine = create_engine('sqlite:///path-to-db')
query = "SELECT mileage, carcass, engine, transmission, state, drive, customs_cleared, price FROM cars WHERE mark='some mark' AND model='some model' AND year='some year'"
df = pd.read_sql_query(query, engine)
df = df.dropna()
df = df.reindex(np.random.permutation(df.index))
X_full = df[['mileage', 'carcass', 'engine', 'transmission', 'state', 'drive', 'customs_cleared']]
y_full = df['price']
n_train = -len(X_full)/5
X_train = X_full[:n_train]
X_test = X_full[n_train:]
y_train = y_full[:n_train]
y_test = y_full[n_train:]
predict = [200000, 0, 2.5, 0, 0, 2, 0] # parameters of the car to predict
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)
y_estimate = model.predict(X_test)
print("Residual sum of squares: %.2f" % np.mean((y_estimate - y_test) ** 2))
print("Variance score: %.2f" % model.score(X_test, y_test))
print("Predicted price: ", model.predict(predict))
Carcass, state, drive and customs cleared are numeric and represent types.
What is correct way to implement prediction? Maybe some data preprocessing or different algorithm.
Thanks for any advance!
Interpreting Linear Regression Coefficients A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.
The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y , disregarding the input features, would get a score of 0.0.
Negative predictive value is the proportion of the cases giving negative test results who are already healthy (3). It is the ratio of subjects truly diagnosed as negative to all those who had negative test results (including patients who were incorrectly diagnosed as healthy).
In practice, R2 will be negative whenever your model's predictions are worse than a constant function that always predicts the mean of the data.
Given that you are using Ridge Regression, you should scale your variables using StandardScaler, or MinMaxScaler:
http://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling
Perhaps using a Pipeline:
http://scikit-learn.org/stable/modules/pipeline.html#pipeline-chaining-estimators
If you were using vanilla Regression, scaling wouldn't matter; but with Ridge Regression, the regularization penalty term (alpha) will treat differently scaled variables differently. See this discussion on stats:
https://stats.stackexchange.com/questions/29781/when-should-you-center-your-data-when-should-you-standardize
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With