Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mean squared error returning unreasonably high numbers

I'm trying to predict the profit each film made on IMDb.

My dataframe and features are as follows:

   Actor1  Actor2  Actor3  Actor4   Day  Director  Genre1  Genre2  Genre3  \
0       0       0       0       0  19.0         0       0       0       0   
1       1       1       1       1   6.0         1       1       1       1   
2       2       2       2       2  20.0         2       0       2       2   
3       3       3       3       3   9.0         3       2       0      -1   
4       4       4       4       4   9.0         4       3       3       3   

   Language  Month  Production  Rated  Runtime  Writer    Year    BoxOffice  

0         1      0           0      0    118.0       0  2007.0   37500000.0  

1         2      1           1      0    151.0       1  2006.0  132300000.0  

2         1      1           2      1    130.0       2  2006.0   53100000.0  

3         1      2           1      0    117.0       3  2007.0  210500000.0  

4         4      3           3      2    117.0       4  2006.0  244052771.0 

and the value I'm trying to predict (target) is the BoxOffice.

I'm following documentation for sklearn exactly as it is (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error)

from sklearn import preprocessing, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score

X = dataset[:,0:16] # Features
Y = dataset[:,16] #Target

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.33)

regr = linear_model.LinearRegression()
regr.fit(X_train,Y_train)
mean_squared_error(Y_test, regr.predict(X_test))

and the output is always something along the lines of: 11385650623660550 ($11,385,650,623,660,500.00)

While the mean of the BoxOffice is: 107989121

etc.

Ive tried multiple different approaches, cross-validation as well as other models (keras) and feel like I've tried everything.

The returning sum is extremely high which makes me question that the problem is not in the model or the data, but something else that I'm missing.

like image 473
Anton Avatar asked Sep 13 '25 15:09

Anton


1 Answers

I think, your problem is not related with mean squared error, it is model itself.

For your categorical features, I recommend you to try another encode method like OneHotEncoder. LabelEncoder is not good option for lineer regression.

(For more information: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f)

Before train your model, take a look correlation of your numeric features with your target variable maybe some of them irrelevant, for categorical features you can try different methods to analyze their relationship with your target variables (like boxplots)

Lineer regression need continuous variables so you may want to try other algorithms as well. Just make sure that you have the enough background before apply them.

like image 98
demirbilek Avatar answered Sep 16 '25 05:09

demirbilek