During implementing a linear regression model on a bag of words, python returned very large/low values. train_data_features
contains all words, which are in the training data. The training data contains about 400 comments of each less than 500 characters with a ranking between 0 and 5. Afterwards, I created a bag of words for each document. While trying to perform a linear regression on the matrix of all bag of words,
from sklearn import linear_model
clf = linear_model.LinearRegression()
clf.fit(train_data_features, train['dim_hate'])
coef = clf.coef_
words = vectorizer.get_feature_names()
for i in range(len(words)):
print(str(words[i]) + " " + str(coef[i]))
the result seems to be very strange (just an example of 3 from 4000). It shows the factors of the created regression function for the words.
btw -0.297473967075
land 54662731702.0
landesrekord -483965045.253
I'm very confused because the target variable is between 0 and 5, but the factors are so different. Most of them have very high/low numbers and I was expecting only values like the one of btw
.
Do you have an idea, why the results are like they are?
It might be that your model is overfitting to the data, since it's trying to exactly match the outputs. You're right to be worried and suspicious, because it means that your model is probably overfitting to your data and will not generalize well to new data. You can try one of two things:
LinearRegression(normalize=True)
and see if it helps with the coefficients. But it will only be a temporary solution.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With