Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python spark: narrowing down most relevant features using PCA

I am using spark 2.2 with python. I am using PCA from ml.feature module. I am using VectorAssembler to feed my features to PCA. To clarify, let's say I have a table with three columns col1, col2 and col3 then I am doing:

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=table.columns, outputCol="features")
df = assembler.transform(table).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

At this time I have run PCA with 2 components and I can look at its values as:

m = model.pc.values.reshape(3, 2)

which corresponds to 3 (= number of columns in my original table) rows and 2 (= number of components in my PCA) columns. My question is are the three rows here in the same order in which I had specified my input columns to the vector assembler above? To clarify it further does the above matrix correspond to:

          | PC1 | PC2 |
 ---------|-----|-----|
    col1  |     |     |
 ---------|-----|-----|
    col2  |     |     |
 ---------|-----|-----|
    col3  |     |     |
 ---------+-----+-----+

Note that the example here is only for clarity. In my real problem I am dealing with ~1600 columns and bunch of selections. I could not find any definitive answer to this in spark documentation. I want to do this to pick best columns / features from my original table to train my model based on the top principal components. Or is there anything else / better in spark ML PCA that I should be looking at to deduce such result?

Or I cannot use PCA for this and have to use other techniques like spearman ranking etc.?

like image 251
Sameer Mahajan Avatar asked Jan 30 '18 16:01

Sameer Mahajan


People also ask

How do you use PCA for feature selection in Python?

Principal Component Analysis (PCA) PCA, generally called data reduction technique, is very useful feature selection technique as it uses linear algebra to transform the dataset into a compressed form. We can implement PCA feature selection technique with the help of PCA class of scikit-learn Python library.

What are the features of PCA?

PCA is a dimensionality reduction technique that has four main parts: feature covariance, eigendecomposition, principal component transformation, and choosing components in terms of explained variance.

Which numerical feature impacts PC1 the most?

we can conclude that feature 1, 3 and 4 are the most important for PC1. Similarly, we can state that feature 2 and then 1 are the most important for PC2. To sum up, we look at the absolute values of the eigenvectors' components corresponding to the k largest eigenvalues.

Can PCA be used for feature selection?

The only way PCA is a valid method of feature selection is if the most important variables are the ones that happen to have the most variation in them .


1 Answers

are the (...) rows here in the same order in which I had specified my input columns

Yes, they are. Let's trace what is going on:

from pyspark.ml.feature import PCA, VectorAssembler

data = [
    (0.0, 1.0, 0.0, 7.0, 0.0), (2.0, 0.0, 3.0, 4.0, 5.0), 
    (4.0, 0.0, 0.0, 6.0, 7.0)
]

df = spark.createDataFrame(data, ["u", "v", "x", "y", "z"])

VectorAseembler follows the order of columns:

assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
vectors = assembler.transform(df).select("features")

vectors.schema[0].metadata
# {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'u'},
#     {'idx': 1, 'name': 'v'},
#     {'idx': 2, 'name': 'x'},
#     {'idx': 3, 'name': 'y'},
#     {'idx': 4, 'name': 'z'}]},
#   'num_attrs': 5}}

So are principal components

model = PCA(inputCol="features", outputCol="pc_features", k=3).fit(vectors)

?model.pc
# Type:        property
# String form: <property object at 0x7feb5bdc1d68>
# Docstring:  
# Returns a principal components Matrix.
# Each column is one principal component.
# 
# .. versionadded:: 2.0.0

Finally sanity check:

import numpy as np

x = np.array(data)
y = model.pc.values.reshape(3, 5).transpose()
z = np.array(model.transform(vectors).rdd.map(lambda x: x.pc_features).collect())

np.linalg.norm(x.dot(y) - z)
# 8.881784197001252e-16
like image 80
Alper t. Turker Avatar answered Sep 22 '22 15:09

Alper t. Turker