Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Coefficients of Linear Model are way too large/low

During implementing a linear regression model on a bag of words, python returned very large/low values. train_data_features contains all words, which are in the training data. The training data contains about 400 comments of each less than 500 characters with a ranking between 0 and 5. Afterwards, I created a bag of words for each document. While trying to perform a linear regression on the matrix of all bag of words,

from sklearn import linear_model 
clf = linear_model.LinearRegression()
clf.fit(train_data_features, train['dim_hate'])

coef = clf.coef_
words = vectorizer.get_feature_names()

for i in range(len(words)):
    print(str(words[i]) + " " + str(coef[i]))

the result seems to be very strange (just an example of 3 from 4000). It shows the factors of the created regression function for the words.

btw -0.297473967075
land 54662731702.0
landesrekord -483965045.253

I'm very confused because the target variable is between 0 and 5, but the factors are so different. Most of them have very high/low numbers and I was expecting only values like the one of btw.

Do you have an idea, why the results are like they are?

like image 522
So S Avatar asked Jan 07 '23 12:01

So S


1 Answers

It might be that your model is overfitting to the data, since it's trying to exactly match the outputs. You're right to be worried and suspicious, because it means that your model is probably overfitting to your data and will not generalize well to new data. You can try one of two things:

  • Run LinearRegression(normalize=True) and see if it helps with the coefficients. But it will only be a temporary solution.
  • Use Ridge regression instead. It is basically doing Linear Regression, except adding a penalty for having coefficients that are too large.
like image 104
mprat Avatar answered Jan 10 '23 07:01

mprat