Finding the mean squared error for a linear regression in python (with scikit learn)

Question

I am trying to do a simple linear regression in python with the x-variable being the word count of a project description and the y-value being the funding speed in days.

I am a bit confused as the root mean square error (RMSE) is 13.77 for the test and 13.88 for the training data. First, shouldnt the RMSE be between 0 and 1? And second, shouldnt the RMSE for test data be higher than for the training data? So I guess, I did something wrong but am not sure where the mistake is.

Also, I need to know the weight coefficient for the regression but unfortunately don't know how to print it as it's kind of hidden within the sklearn methods. Can anyone help out here?

This is what I have so far:

import numpy as np
import matplotlib.pyplot as plt
import sqlite3
from sklearn.model_selection import train_test_split
from sklearn import linear_model

con = sqlite3.connect('database.db')
cur = con.cursor()

# y-variable in regression is funding speed ("DAYS_NEEDED")    
cur.execute("SELECT DAYS_NEEDED FROM success")
y = cur.fetchall()                  # list of tuples
y = np.array([i[0] for i in y])     # list of int   # y.shape = (1324476,)

# x-variable in regression is the project description length ("WORD_COUNT")
cur.execute("SELECT WORD_COUNT FROM success")
x = cur.fetchall()
x = np.array([i[0] for i in x])     # list of int   # x.shape = (1324476,)

# Get the train and test data split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Fit a model
lm = linear_model.LinearRegression()
x_train = x_train.reshape(-1, 1)    # new shape: (1059580, 1)
y_train = y_train.reshape(-1, 1)    # new shape: (1059580, 1)
model = lm.fit(x_train, y_train)
x_test = x_test.reshape(-1, 1)      # new shape: (264896, 1)
predictions_test = lm.predict(x_test)
predictions_train = lm.predict(x_train)

print("y_test[5]: ", y_test[5])     # 14
print("predictions[5]: ", predictions_test[5]) # [ 12.6254537]

# Calculate the root mean square error (RMSE) for test and training data
N = len(y_test)
rmse_test = np.sqrt(np.sum((np.array(y_test).flatten() - np.array(predictions_test).flatten())**2)/N)
print("RMSE TEST: ", rmse_test)     # 13.770731326

N = len(y_train)
rmse_train = np.sqrt(np.sum((np.array(y_train).flatten() - np.array(predictions_train).flatten())**2)/N)
print("RMSE train: ", rmse_train)   # 13.8817814595

Any help is much appreciated! Thanks!

imperialgendarme · Accepted Answer

RMSE has the same unit as the dependent variable. This means that if the variable you're trying to predict varies from 0 to 100, an RMSE of 99 is terrible! If say you have an RMSE of 5 for the data ranging from 0 to 100, RMSE of 5 is spectacular. BUT, if the RMSE is 5 for data ranging from 1 to 10, then you have a problem! I hope this is able to drive the point home.
Since the RMSE of your train and test is similar, pat yourself on the back! You've actually done a good job! If RMSE of test > train, you've overfit slightly.

As per what Umang said in the comments, you use model.coef_ and model.intercept_ to print the weights that your model has calculated to be optimal.

Finding the mean squared error for a linear regression in python (with scikit learn)

Tags:

python

scikit-learn

linear-regression

mse

Christina

1 Answers

imperialgendarme

Recent Activity

Donate For Us

Finding the mean squared error for a linear regression in python (with scikit learn)

Tags:

python

scikit-learn

linear-regression

mse

Christina

1 Answers

imperialgendarme

Related questions

Recent Activity

Donate For Us