Currently I am implementing a gaussian regression process model and I have been having some problems when trying to apply it to the scope of my problem. My problem is that I have as input to my model three variables, which one of these values (theta
) has way more significant impact than the other two, alpha1
and alpha2
. The inputs and outputs have the following values (just a few values to better understand):
# X (theta, alpha1, alpha2)
array([[ 9.07660169, 0.61485493, 1.70396493],
[ 9.51498486, -5.49212002, -0.68659511],
[10.45737558, -2.2739529 , -2.03918961],
[10.46857663, -0.4587848 , 0.54434441],
[ 9.10133699, 8.38066374, 0.66538822],
[ 9.17279647, 0.36327109, -0.30558115],
[10.36532505, 0.87099676, -7.73775872],
[10.13681026, -1.64084098, -0.09169159],
[10.38549264, 1.80633583, 1.3453195 ],
[ 9.72533357, 0.55861224, 0.74180309])
# y
array([4.93483686, 5.66226844, 7.51133372, 7.54435854, 4.92758927,
5.0955348 , 7.26606153, 6.86027353, 7.36488184, 6.06864003])
As it can be seen, theta
alters significantly the value of y, whereas changes in alpha1
and alpha2
are way more subtle over the y.
The situation that I am facing is that I am applying a model to my data and out of this model, I am applying a minimization with Scipy to the model setting one of the inputs variables fixed as on this minimization. The code bellow might illustrate better:
# model fitting
kernel = C(1.0, (1e-3, 1e3))*RBF(10,(1e-2,1e2))
model = GaussianProcessRegressor(kernel = kernel, n_restarts_optimizer = 9,optimizer='fmin_l_bfgs_b')
model.fit(X,y)
# minimization
bnds = np.array([(theta,theta),
(alpha1.min(),
alpha1.max()),
(alpha2.min(),
alpha2.max())])
x0 = [theta,alpha1.min(),alpha2.min()]
residual_plant = minimize(lambda x: -model.predict(np.array([x])),
x0, method='SLSQP',bounds=bnds,
options = {'eps': np.radians(5)})
My goal with that is that I want to set the first variable value
as a fixed value and I want to study impact that the other two variables, alpha1
and alpha2
, have over the output y
for that specific theta
value. The specific reasoning behind the minimization is that I want to find the combinations of alpha t1
and alpha2
that return me the optimal y
for this fixed theta
. Therefore, I was wondering how would I do that, as I believe that theta
must be influencing drastically the impact that my other two variables have over my output, and then it might be negatively influencing my model on the task that I have in hand, as it has a heavier weight and will hidden the influence of alpha1
and alpha2
on my model, however, I cannot really ignore it or not feed it into my model as I want to find the optimal y
value for this fixed theta
and therefore I would still need to use theta
as input.
My question is, how to deal with such issue? Is there any statistical trick to eliminate or at least diminish this influence without having to eliminate theta
from my model? Is there a better way to deal with my problem?
First, did you normalize the data before training?
Second, it sounds like you want to see the relationship between x and y with a constant theta.
If you take your dataset and sort it by theta, you can try to find a group of records where theta is the same or very similar, where its variance is low and it doesnβt change much. You can take that group of data and form a new dataframe, and drop the theta column (because we picked a portion of the dataset where theta has a low variance and so it isnβt very useful). Then, you can train your model or do some data visualization on just the alpha1 and alpha2 data.
My overall understanding to your question is that you want to achieve two things:
To study the effect of alpha1 and alpha2 after turning theta into constant (i.e. eliminating the influence of theta on the model).
To find the best combination of alpha1 and alpha2 that returns the optimal y for this fixed theta.
That can be summarized under the study of the Correlation between the input variables and the target variable.
Since Correlation studies the changes in the relation between one variable from another independently, then you can get a good insight about the influence of alpha1, alpha2 and theta on y.
Two interesting correlations exist to help you:
Let's give it a try:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.DataFrame(columns=['theta', 'alpha1', 'alpha2', 'y'],
data=[[ 9.07660169, 0.61485493, 1.70396493, 4.93483686],
[ 9.51498486, -5.49212002, -0.68659511, 5.66226844],
[10.45737558, -2.2739529 , -2.03918961, 7.51133372],
[10.46857663, -0.4587848 , 0.54434441, 7.54435854],
[ 9.10133699, 8.38066374, 0.66538822, 4.92758927],
[ 9.17279647, 0.36327109, -0.30558115, 5.0955348],
[10.36532505, 0.87099676, -7.73775872, 7.26606153],
[10.13681026, -1.64084098, -0.09169159, 6.86027353],
[10.38549264, 1.80633583, 1.3453195, 7.36488184],
[ 9.72533357, 0.55861224, 0.74180309, 6.06864003]])
plt.figure(figsize=(10, 8))
ax = sns.heatmap(df.corr(method="pearson"), annot=True)
plt.xticks(rotation = 90)
plt.title('Pearson Correlation Heatmap')
plt.show()
plt.figure(figsize=(10, 8))
ax = sns.heatmap(df.corr(method="spearman"), annot=True)
plt.xticks(rotation = 90)
plt.title('Spearman Correlation Heatmap')
plt.show()
As you can see, we got very good insights about the relation between theta, alpha1 and alpha2 with each other and with y.
According to Cohen's Standard, we can conclude that:
But wait a minute, since alpha1 and alpha2 have medium correlation with y but weak correlation (to medium) between each other, we can then exploit the variance to produce an optimization function L
that is a linear combination between alpha1 and alpha2, as follows:
Let m
, n
be two weights that maximize the correlation between alpha1 and alpha2 features with y according to the optimization function L
:
m * alpha1 + n * alpha2
The optimal coefficients m
and n
achieving the maximum correlation between L
and y
do depend on the variances of alpha1, alpha2 and y.
We can derive from that, the following optimization solution:
m = [ πΆππ£(π,π) * πΆππ£(π,π) β πΆππ£(π,π) * πππ(π) / πΆππ£(π,π) * πΆππ£(π,π) β πΆππ£(π,π) * πππ(π) ] * n
where a
, b
and c
correspond to alpha1, alpha2 and y respectively.
By choosing m
or n
to be either 1 or -1 , we can find the optimal solution to engineer the new feature.
cov = df[['alpha1', 'alpha2', 'y']].cov()
# applying the optimization function: a = alpha1 , b = alpha2 and c = y
# note that cov of a feature with itself = variance
coef = (cov['alpha2']['y'] * cov['alpha1']['alpha2'] - cov['alpha1']['y'] * cov['alpha2']['alpha2']) / \
(cov['alpha1']['y'] * cov['alpha1']['alpha2'] - cov['alpha2']['y'] * cov['alpha1']['alpha1'])
# let n = 1 --> m = coef --> L = coef * alpha1 + alpha2 : which is the new feature to add
df['alpha12'] = coef * df['alpha1'] + df['alpha2']
As you can see, there is a noticeable improvement in the correlation of the introduced alpha12.
Furthermore, related to question 1, to decrease the correlation of theta; and since the correlation is given by:
Corr(theta, y) = Cov(theta, y) / [sqrt(Var(that)) * sqrt(Var(y))]
You can increase the variance of theta. To do so, simply sample n points from some distribution and add them to the corresponding indices as noise. Save this noise list for future use in case you need to get back to the original theta, something like this:
cov = df[['y', 'theta']].cov()
print("Theta Variance :: Before = {}".format(cov['theta']['theta']))
np.random.seed(2020) # add seed to make it reproducible for future undo
# create noise drawn from uniform distribution
noise = np.random.uniform(low=1.0, high=10., size=df.shape[0])
df['theta'] += noise # add noise to increase variance
cov = df[['y', 'theta']].cov()
print("Theta Variance :: After = {}".format(cov['theta']['theta']))
# df['theta'] -= noise to back to original variance
plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(method="spearman"), annot=True)
plt.xticks(rotation = 90)
plt.title('Spearman Correlation Heatmap After Reducing Variance of Theta\n')
plt.show()
plt.figure(figsize=(15, 15))
ax = sns.heatmap(df.corr(method="pearson"), annot=True)
plt.xticks(rotation = 90)
plt.title('Pearson Correlation Heatmap After Reducing Variance of Theta\n')
plt.show()
Theta Variance :: Before = 0.3478030891329485
Theta Variance :: After = 7.552229545792681
Now Alpha12 is taking the lead and has the highest influence on the target variable y.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With