Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Diminishing the impact of one variable over output in a regression model

Currently I am implementing a gaussian regression process model and I have been having some problems when trying to apply it to the scope of my problem. My problem is that I have as input to my model three variables, which one of these values (theta) has way more significant impact than the other two, alpha1 and alpha2. The inputs and outputs have the following values (just a few values to better understand):

# X (theta, alpha1, alpha2)
array([[ 9.07660169,  0.61485493,  1.70396493],
       [ 9.51498486, -5.49212002, -0.68659511],
       [10.45737558, -2.2739529 , -2.03918961],
       [10.46857663, -0.4587848 ,  0.54434441],
       [ 9.10133699,  8.38066374,  0.66538822],
       [ 9.17279647,  0.36327109, -0.30558115],
       [10.36532505,  0.87099676, -7.73775872],
       [10.13681026, -1.64084098, -0.09169159],
       [10.38549264,  1.80633583,  1.3453195 ],
       [ 9.72533357,  0.55861224,  0.74180309])

# y
array([4.93483686, 5.66226844, 7.51133372, 7.54435854, 4.92758927,
       5.0955348 , 7.26606153, 6.86027353, 7.36488184, 6.06864003])

As it can be seen, theta alters significantly the value of y, whereas changes in alpha1 and alpha2 are way more subtle over the y.

The situation that I am facing is that I am applying a model to my data and out of this model, I am applying a minimization with Scipy to the model setting one of the inputs variables fixed as on this minimization. The code bellow might illustrate better:

# model fitting
kernel = C(1.0, (1e-3, 1e3))*RBF(10,(1e-2,1e2))
model = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 9,optimizer='fmin_l_bfgs_b')
model.fit(X,y)

# minimization
bnds = np.array([(theta,theta),
                 (alpha1.min(),
                  alpha1.max()),
                 (alpha2.min(),
                  alpha2.max())])

x0 = [theta,alpha1.min(),alpha2.min()]

residual_plant = minimize(lambda x: -model.predict(np.array([x])),
                          x0, method='SLSQP',bounds=bnds, 
                          options = {'eps': np.radians(5)})

My goal with that is that I want to set the first variable value as a fixed value and I want to study impact that the other two variables, alpha1 and alpha2, have over the output y for that specific theta value. The specific reasoning behind the minimization is that I want to find the combinations of alpha t1 and alpha2 that return me the optimal y for this fixed theta. Therefore, I was wondering how would I do that, as I believe that theta must be influencing drastically the impact that my other two variables have over my output, and then it might be negatively influencing my model on the task that I have in hand, as it has a heavier weight and will hidden the influence of alpha1 and alpha2 on my model, however, I cannot really ignore it or not feed it into my model as I want to find the optimal y value for this fixed theta and therefore I would still need to use theta as input.

My question is, how to deal with such issue? Is there any statistical trick to eliminate or at least diminish this influence without having to eliminate theta from my model? Is there a better way to deal with my problem?

like image 969
fnaos Avatar asked Mar 01 '20 21:03

fnaos


2 Answers

First, did you normalize the data before training?

Second, it sounds like you want to see the relationship between x and y with a constant theta.

If you take your dataset and sort it by theta, you can try to find a group of records where theta is the same or very similar, where its variance is low and it doesn’t change much. You can take that group of data and form a new dataframe, and drop the theta column (because we picked a portion of the dataset where theta has a low variance and so it isn’t very useful). Then, you can train your model or do some data visualization on just the alpha1 and alpha2 data.

like image 51
Tdoggo Avatar answered Oct 19 '22 20:10

Tdoggo


My overall understanding to your question is that you want to achieve two things:

  1. To study the effect of alpha1 and alpha2 after turning theta into constant (i.e. eliminating the influence of theta on the model).

  2. To find the best combination of alpha1 and alpha2 that returns the optimal y for this fixed theta.

That can be summarized under the study of the Correlation between the input variables and the target variable.

Since Correlation studies the changes in the relation between one variable from another independently, then you can get a good insight about the influence of alpha1, alpha2 and theta on y.

Two interesting correlations exist to help you:

  1. Pearson's Correlation: Numerically reflects the strength of a linear correlation.
  2. Spearman's Correlation: Numerically reflects the strength of a monotonic correlation (i.e. the rank, in case the correlation is not linear).

Let's give it a try:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.DataFrame(columns=['theta', 'alpha1', 'alpha2', 'y'],
                  data=[[ 9.07660169,  0.61485493,  1.70396493, 4.93483686],
                       [ 9.51498486, -5.49212002, -0.68659511, 5.66226844],
                       [10.45737558, -2.2739529 , -2.03918961,  7.51133372],
                       [10.46857663, -0.4587848 ,  0.54434441, 7.54435854],
                       [ 9.10133699,  8.38066374,  0.66538822, 4.92758927],
                       [ 9.17279647,  0.36327109, -0.30558115, 5.0955348],
                       [10.36532505,  0.87099676, -7.73775872, 7.26606153],
                       [10.13681026, -1.64084098, -0.09169159, 6.86027353],
                       [10.38549264,  1.80633583,  1.3453195, 7.36488184],
                       [ 9.72533357,  0.55861224,  0.74180309, 6.06864003]])


plt.figure(figsize=(10, 8))
ax = sns.heatmap(df.corr(method="pearson"), annot=True)
plt.xticks(rotation = 90)
plt.title('Pearson Correlation Heatmap')
plt.show()

plt.figure(figsize=(10, 8))
ax = sns.heatmap(df.corr(method="spearman"), annot=True)
plt.xticks(rotation = 90)
plt.title('Spearman Correlation Heatmap')
plt.show()

Pearson's Correlation

Spearman's Correlation

As you can see, we got very good insights about the relation between theta, alpha1 and alpha2 with each other and with y.

According to Cohen's Standard, we can conclude that:

  • Alpha1 and Alpha2 have medium correlation with y.
  • Theta has very strong correlation with y.
  • Alpha1 has weak linear correlation with alpha2 but medium monotonic correlation.
  • Alpha1 and Alpha2 have medium correlation with theta.

But wait a minute, since alpha1 and alpha2 have medium correlation with y but weak correlation (to medium) between each other, we can then exploit the variance to produce an optimization function L that is a linear combination between alpha1 and alpha2, as follows:

Let m, n be two weights that maximize the correlation between alpha1 and alpha2 features with y according to the optimization function L:

m * alpha1 + n * alpha2

The optimal coefficients m and n achieving the maximum correlation between L and y do depend on the variances of alpha1, alpha2 and y.

We can derive from that, the following optimization solution:

m = [ πΆπ‘œπ‘£(𝑏,𝑐) * πΆπ‘œπ‘£(π‘Ž,𝑏) βˆ’ πΆπ‘œπ‘£(π‘Ž,𝑐) * π‘‰π‘Žπ‘Ÿ(𝑏) / πΆπ‘œπ‘£(π‘Ž,𝑐) * πΆπ‘œπ‘£(π‘Ž,𝑏) βˆ’ πΆπ‘œπ‘£(𝑏,𝑐) * π‘‰π‘Žπ‘Ÿ(π‘Ž) ] * n

where a , b and c correspond to alpha1, alpha2 and y respectively.

By choosing m or n to be either 1 or -1 , we can find the optimal solution to engineer the new feature.

cov = df[['alpha1', 'alpha2', 'y']].cov()

# applying the optimization function: a = alpha1 , b = alpha2 and c = y
# note that cov of a feature with itself = variance
coef = (cov['alpha2']['y'] * cov['alpha1']['alpha2'] - cov['alpha1']['y'] * cov['alpha2']['alpha2']) / \
       (cov['alpha1']['y'] * cov['alpha1']['alpha2'] - cov['alpha2']['y'] * cov['alpha1']['alpha1'])
# let n = 1 --> m = coef --> L = coef * alpha1 + alpha2 :  which is the new feature to add
df['alpha12'] = coef * df['alpha1'] + df['alpha2']

covariance correlation after alpha12

As you can see, there is a noticeable improvement in the correlation of the introduced alpha12.

Furthermore, related to question 1, to decrease the correlation of theta; and since the correlation is given by:

Corr(theta, y) = Cov(theta, y) / [sqrt(Var(that)) * sqrt(Var(y))]

You can increase the variance of theta. To do so, simply sample n points from some distribution and add them to the corresponding indices as noise. Save this noise list for future use in case you need to get back to the original theta, something like this:

cov = df[['y', 'theta']].cov()
print("Theta Variance :: Before = {}".format(cov['theta']['theta']))

np.random.seed(2020)  # add seed to make it reproducible for future undo
# create noise drawn from uniform distribution
noise = np.random.uniform(low=1.0, high=10., size=df.shape[0])
df['theta'] += noise  # add noise to increase variance
cov = df[['y', 'theta']].cov()
print("Theta Variance :: After = {}".format(cov['theta']['theta']))

# df['theta'] -= noise to back to original variance

plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(method="spearman"), annot=True)
plt.xticks(rotation = 90)
plt.title('Spearman Correlation Heatmap After Reducing Variance of Theta\n')
plt.show()

plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(method="pearson"), annot=True)
plt.xticks(rotation = 90)
plt.title('Pearson Correlation Heatmap After Reducing Variance of Theta\n')
plt.show()

Theta Variance :: Before = 0.3478030891329485

Theta Variance :: After = 7.552229545792681

pearson corr after theta var

spearman corr after theta var

Now Alpha12 is taking the lead and has the highest influence on the target variable y.

like image 2
Yahya Avatar answered Oct 19 '22 20:10

Yahya