python spark: narrowing down most relevant features using PCA

Tags:

I am using spark 2.2 with python. I am using PCA from ml.feature module. I am using VectorAssembler to feed my features to PCA. To clarify, let's say I have a table with three columns col1, col2 and col3 then I am doing:

from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=table.columns, outputCol="features")
df = assembler.transform(table).select("features")
from pyspark.ml.feature import PCA
pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)

At this time I have run PCA with 2 components and I can look at its values as:

m = model.pc.values.reshape(3, 2)

which corresponds to 3 (= number of columns in my original table) rows and 2 (= number of components in my PCA) columns. My question is are the three rows here in the same order in which I had specified my input columns to the vector assembler above? To clarify it further does the above matrix correspond to:

          | PC1 | PC2 |
 ---------|-----|-----|
    col1  |     |     |
 ---------|-----|-----|
    col2  |     |     |
 ---------|-----|-----|
    col3  |     |     |
 ---------+-----+-----+

Note that the example here is only for clarity. In my real problem I am dealing with ~1600 columns and bunch of selections. I could not find any definitive answer to this in spark documentation. I want to do this to pick best columns / features from my original table to train my model based on the top principal components. Or is there anything else / better in spark ML PCA that I should be looking at to deduce such result?

Or I cannot use PCA for this and have to use other techniques like spearman ranking etc.?

251

asked Jan 30 '18 16:01

Sameer Mahajan

1 Answers

are the (...) rows here in the same order in which I had specified my input columns

Yes, they are. Let's trace what is going on:

from pyspark.ml.feature import PCA, VectorAssembler

data = [
    (0.0, 1.0, 0.0, 7.0, 0.0), (2.0, 0.0, 3.0, 4.0, 5.0), 
    (4.0, 0.0, 0.0, 6.0, 7.0)
]

df = spark.createDataFrame(data, ["u", "v", "x", "y", "z"])

VectorAseembler follows the order of columns:

assembler = VectorAssembler(inputCols=df.columns, outputCol="features")
vectors = assembler.transform(df).select("features")

vectors.schema[0].metadata
# {'ml_attr': {'attrs': {'numeric': [{'idx': 0, 'name': 'u'},
#     {'idx': 1, 'name': 'v'},
#     {'idx': 2, 'name': 'x'},
#     {'idx': 3, 'name': 'y'},
#     {'idx': 4, 'name': 'z'}]},
#   'num_attrs': 5}}

So are principal components

model = PCA(inputCol="features", outputCol="pc_features", k=3).fit(vectors)

?model.pc
# Type:        property
# String form: <property object at 0x7feb5bdc1d68>
# Docstring:  
# Returns a principal components Matrix.
# Each column is one principal component.
# 
# .. versionadded:: 2.0.0

Finally sanity check:

import numpy as np

x = np.array(data)
y = model.pc.values.reshape(3, 5).transpose()
z = np.array(model.transform(vectors).rdd.map(lambda x: x.pc_features).collect())

np.linalg.norm(x.dot(y) - z)
# 8.881784197001252e-16

answered Sep 22 '22 15:09

Alper t. Turker

Related questions
                            
                                filter pushdown using spark-sql on map type column in parquet
                            
                                How to save file in Feather format\storage from Spark?
                            
                                Pyspark Column.isin() for a large set
                            
                                run Spark-Submit on YARN but Imbalance (only 1 node is working)
                            
                                Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/spark/Logging
                            
                                Real-time analysis of event logs with Elasticsearch
                            
                                Apache Spark Maven Dependencies for release and develop an app
                            
                                How to implement Stanford CoreNLP wrapper for Apache Spark using sparklyr?
                            
                                Using Pycuda with PySpark - nvcc not found
                            
                                Spark UI DAG stage disconnected
                            
                                Large scheduler delay in Apache Spark tasks using deploy mode cluster
                            
                                Spark HashingTF result explanation
                            
                                About a java.lang.NoClassDefFoundError: Could not initialize class org.xerial.snappy.Snappy
                            
                                Cosine similarity of word2vec more than 1
                            
                                How to write a dataframe in pyspark having null values to CSV
                            
                                Spark master memory requirements related to data size
                            
                                How to join two spark dataset to one with java objects?
                            
                                How much copies of the environment does spark do?
                            
                                Spark createTableColumnTypes Not Resulting in user supplied schema
                            
                                Accessing hdfs from docker-hadoop-spark--workbench via zeppelin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python spark: narrowing down most relevant features using PCA

Tags:

machine-learning

apache-spark

pyspark

feature-selection

pca

Sameer Mahajan

People also ask

1 Answers

Alper t. Turker

Recent Activity

Donate For Us