Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Linear regression gives worse results after normalization or standardization

I'm performing linear regression on this dataset: archive.ics.uci.edu/ml/datasets/online+news+popularity

It contains various types of features - rates, binary, numbers etc.

I've tried using scikit-learn Normalizer, StandardScaler and PowerTransformer, but the've all resulted in worse results than without using them.

I'm using them like this:

from sklearn.preprocessing import StandardScaler
X = df.drop(columns=['url', 'shares'])
Y = df['shares']
transformer = StandardScaler().fit(X)
X_scaled = transformer.transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
perform_linear_and_ridge_regression(X=X_scaled, Y=Y)

The function on the last line perform_linear_and_ridge_regression() is correct for sure and is using GridSearchCV to determine the best hyperparameters.

Just to make sure I include the function as well:

def perform_linear_and_ridge_regression(X, Y):
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=10) 
    lin_reg_parameters = { 'fit_intercept': [True, False] }
    lin_reg = GridSearchCV(LinearRegression(), lin_reg_parameters, cv=5)
    lin_reg.fit(X=X_train, y=Y_train)
    Y_pred = lin_reg.predict(X_test)
    print('Linear regression MAE =', median_absolute_error(Y_test, Y_pred))

The results are surprising as all of them provide worse results:

Linear reg. on original data: MAE = 1620.510555135375

Linear reg. after using Normalizer: MAE = 1979.8525218964242

Linear reg. after using StandardScaler: MAE = 2915.024521207241

Linear reg. after using PowerScaler: MAE = 1663.7148884463259

Is this just a special case, where Standardization doesn't help, or am I doing something wrong?

EDIT: Even when I leave the binary features out, most of the transformers gives worse results.

like image 927
adamlowlife Avatar asked Feb 03 '26 20:02

adamlowlife


1 Answers

Your dataset has many categorical and ordinal features. You should take care of that first separately. Also, it seems like you are applying normalization on categorical variables too, which is completely wrong.

Here is nice-link, which explains how to handle categorical features for regression problem.

like image 155
Ankish Bansal Avatar answered Feb 05 '26 11:02

Ankish Bansal



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!