I have a classification problem, ie I want to predict a binary target based on a collection of numerical features, using logistic regression, and after running a Principal Components Analysis (PCA).
I have 2 datasets: df_train
and df_valid
(training set and validation set respectively) as pandas data frame, containing the features and the target. As a first step, I have used get_dummies
pandas function to transform all the categorical variables as boolean. For example, I would have:
n_train = 10
np.random.seed(0)
df_train = pd.DataFrame({"f1":np.random.random(n_train), \
"f2": np.random.random(n_train), \
"f3":np.random.randint(0,2,n_train).astype(bool),\
"target":np.random.randint(0,2,n_train).astype(bool)})
In [36]: df_train
Out[36]:
f1 f2 f3 target
0 0.548814 0.791725 False False
1 0.715189 0.528895 True True
2 0.602763 0.568045 False True
3 0.544883 0.925597 True True
4 0.423655 0.071036 True True
5 0.645894 0.087129 True False
6 0.437587 0.020218 True True
7 0.891773 0.832620 True False
8 0.963663 0.778157 False False
9 0.383442 0.870012 True True
n_valid = 3
np.random.seed(1)
df_valid = pd.DataFrame({"f1":np.random.random(n_valid), \
"f2": np.random.random(n_valid), \
"f3":np.random.randint(0,2,n_valid).astype(bool),\
"target":np.random.randint(0,2,n_valid).astype(bool)})
In [44]: df_valid
Out[44]:
f1 f2 f3 target
0 0.417022 0.302333 False False
1 0.720324 0.146756 True False
2 0.000114 0.092339 True True
I would like now to apply a PCA to reduce the dimensionality of my problem, then use LogisticRegression
from sklearn to train and get prediction on my validation set, but I'm not sure the procedure I follow is correct. Here is what I do:
The idea is that I need to transform both my training and validation set the same way with PCA. In other words, I can not perform PCA separately. Otherwise, they will be projected on different eigenvectors.
from sklearn.decomposition import PCA
pca = PCA(n_components=2) #assume to keep 2 components, but doesn't matter
newdf_train = pca.fit_transform(df_train.drop("target", axis=1))
newdf_valid = pca.transform(df_valid.drop("target", axis=1)) #not sure here if this is right
It's not necessary, but I prefer to keep things as dataframe:
features_train = pd.DataFrame(newdf_train)
features_valid = pd.DataFrame(newdf_valid)
And now I perform the logistic regression
from sklearn.linear_model import LogisticRegression
cls = LogisticRegression()
cls.fit(features_train, df_train["target"])
predictions = cls.predict(features_valid)
I think step 2 is correct, but I have more doubts about step 1: is this the way I'm supposed to chain PCA, then a classifier ?
Therefore, PCA can effectively eliminate multicollinearity between features. In this post, we'll build a logistic regression model on a classification dataset called breast_cancer data. The initial model can be considered as the base model.
We propose “supervised principal component analysis (supervised PCA)”, a generalization of PCA that is uniquely effective for regression and classification problems with high-dimensional input data. It works by estimating a sequence of principal components that have maximal dependence on the response variable.
PCA will NOT consider the response variable but only the variance of the independent variables. Logistic Regression will consider how each independent variable impact on response variable.
There's a pipeline in sklearn for this purpose.
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
pca = PCA(n_components=2)
clf = LogisticRegression()
pipe = Pipeline([('pca', pca), ('logistic', clf)])
pipe.fit(features_train, df_train["target"])
predictions = pipe.predict(features_valid)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With