I am using spark 2.2 with python. I am using PCA from ml.feature module. I am using VectorAssembler to feed my features to PCA. To clarify, let's say I have a table with three columns col1, col2 and col3 then I am doing:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=table.columns, outputCol="features")
df = assembler.transform(table).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
At this time I have run PCA with 2 components and I can look at its values as:
m = model.pc.values.reshape(3, 2)
which corresponds to 3 (= number of columns in my original table) rows and 2 (= number of components in my PCA) columns. My question is are the three rows here in the same order in which I had specified my input columns to the vector assembler above? To clarify it further does the above matrix correspond to:
| PC1 | PC2 |
---------|-----|-----|
col1 | | |
---------|-----|-----|
col2 | | |
---------|-----|-----|
col3 | | |
---------+-----+-----+
Note that the example here is only for clarity. In my real problem I am dealing with ~1600 columns and bunch of selections. I could not find any definitive answer to this in spark documentation. I want to do this to pick best columns / features from my original table to train my model based on the top principal components. Or is there anything else / better in spark ML PCA that I should be looking at to deduce such result?
Or I cannot use PCA for this and have to use other techniques like spearman ranking etc.?
Principal Component Analysis (PCA) PCA, generally called data reduction technique, is very useful feature selection technique as it uses linear algebra to transform the dataset into a compressed form. We can implement PCA feature selection technique with the help of PCA class of scikit-learn Python library.
PCA is a dimensionality reduction technique that has four main parts: feature covariance, eigendecomposition, principal component transformation, and choosing components in terms of explained variance.
we can conclude that feature 1, 3 and 4 are the most important for PC1. Similarly, we can state that feature 2 and then 1 are the most important for PC2. To sum up, we look at the absolute values of the eigenvectors' components corresponding to the k largest eigenvalues.
The only way PCA is a valid method of feature selection is if the most important variables are the ones that happen to have the most variation in them .
are the (...) rows here in the same order in which I had specified my input columns
Yes, they are. Let's trace what is going on:
from pyspark.ml.feature import PCA, VectorAssembler
data = [
(0.0, 1.0, 0.0, 7.0, 0.0), (2.0, 0.0, 3.0, 4.0, 5.0),
(4.0, 0.0, 0.0, 6.0, 7.0)
]
df = spark.createDataFrame(data, ["u", "v", "x", "y", "z"])
VectorAseembler
follows the order of columns:
assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
vectors = assembler.transform(df).select("features")
vectors.schema[0].metadata
# {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'u'},
# {'idx': 1, 'name': 'v'},
# {'idx': 2, 'name': 'x'},
# {'idx': 3, 'name': 'y'},
# {'idx': 4, 'name': 'z'}]},
# 'num_attrs': 5}}
So are principal components
model = PCA(inputCol="features", outputCol="pc_features", k=3).fit(vectors)
?model.pc
# Type: property
# String form: <property object at 0x7feb5bdc1d68>
# Docstring:
# Returns a principal components Matrix.
# Each column is one principal component.
#
# .. versionadded:: 2.0.0
Finally sanity check:
import numpy as np
x = np.array(data)
y = model.pc.values.reshape(3, 5).transpose()
z = np.array(model.transform(vectors).rdd.map(lambda x: x.pc_features).collect())
np.linalg.norm(x.dot(y) - z)
# 8.881784197001252e-16
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With