Coefficients of Linear Model are way too large/low

Question

During implementing a linear regression model on a bag of words, python returned very large/low values. train_data_features contains all words, which are in the training data. The training data contains about 400 comments of each less than 500 characters with a ranking between 0 and 5. Afterwards, I created a bag of words for each document. While trying to perform a linear regression on the matrix of all bag of words,

from sklearn import linear_model 
clf = linear_model.LinearRegression()
clf.fit(train_data_features, train['dim_hate'])

coef = clf.coef_
words = vectorizer.get_feature_names()

for i in range(len(words)):
    print(str(words[i]) + " " + str(coef[i]))

the result seems to be very strange (just an example of 3 from 4000). It shows the factors of the created regression function for the words.

btw -0.297473967075
land 54662731702.0
landesrekord -483965045.253

I'm very confused because the target variable is between 0 and 5, but the factors are so different. Most of them have very high/low numbers and I was expecting only values like the one of btw.

Do you have an idea, why the results are like they are?

mprat · Accepted Answer

It might be that your model is overfitting to the data, since it's trying to exactly match the outputs. You're right to be worried and suspicious, because it means that your model is probably overfitting to your data and will not generalize well to new data. You can try one of two things:

Run LinearRegression(normalize=True) and see if it helps with the coefficients. But it will only be a temporary solution.
Use Ridge regression instead. It is basically doing Linear Regression, except adding a penalty for having coefficients that are too large.

Coefficients of Linear Model are way too large/low

Tags:

python

python-3.x

numpy

data-analysis

scikit-learn

So S

1 Answers

mprat

Recent Activity

Donate For Us

Coefficients of Linear Model are way too large/low

Tags:

python

python-3.x

numpy

data-analysis

scikit-learn

So S

1 Answers

mprat

Related questions

Recent Activity

Donate For Us