I am trying to calculate the root mean squared error in from a pandas data frame. I have checked out previous links on stacked overflow such as Root mean square error in python and the scikit learn documentation http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html I was hoping someone out there would shed some light on what I am doing wrong. Here is the dataset. Here is my code.
import pandas as pd
import numpy as np
sales = pd.read_csv("home_data.csv")
from sklearn.cross_validation import train_test_split
train_data,test_data = train_test_split(sales,train_size=0.8)
from sklearn.linear_model import LinearRegression
X = train_data[['sqft_living']]
y=train_data.price
#build the linear regression object
lm=LinearRegression()
# Train the model using the training sets
lm.fit(X,y)
#print the y intercept
print(lm.intercept_)
#print the coefficents
print(lm.coef_)
lm.predict(300)
from math import sqrt
from sklearn.metrics import mean_squared_error
y_true=train_data.price.loc[0:5,]
test_data=test_data[['price']].reset_index()
y_pred=test_data.price.loc[0:5,]
predicted =y_pred.as_matrix()
actual= y_true.as_matrix()
mean_squared_error(actual, predicted)
So this is what worked for me. I had to transform the test dataset values for sqft living from row to column.
from sklearn.linear_model import LinearRegression
X = train_data[['sqft_living']]
y=train_data.price
#build the linear regression object
lm=LinearRegression()
# Train the model using the training sets
lm.fit(X,y)
test_X = test_data.sqft_living.values
print(test_X)
print(np.shape(test_X))
print(len(test_X))
test_X = np.reshape(test_X, [4323, 1])
print(test_X)
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
MSE = mean_squared_error(y_true = test_data.price.values, y_pred = lm.predict(test_X))
MSE
MSE**(0.5)
You're comparing test-set labels to training-set labels. I believe that what you actually want to do is compare test-set labels to predicted test-set labels.
For example:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
sales = pd.read_csv("home_data.csv")
train_data, test_data = train_test_split(sales,train_size=0.8)
# Train the model
X = train_data[['sqft_living']]
y = train_data.price
lm = LinearRegression()
lm.fit(X, y)
# Predict on the test data
X_test = test_data[['sqft_living']]
y_test = test_data.price
y_pred = lm.predict(X_test)
# Compute the root-mean-square
rms = np.sqrt(mean_squared_error(y_test, y_pred))
print(rms)
# 260435.511036
Note that scikit-learn can in general handle Pandas DataFrames and Series inputs without explicit conversion to numpy arrays. The error in the code snippet in your question has to do with the fact that the two arrays passed to mean_squared_error()
are different sizes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With