Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature Selection in PySpark

I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.

from sklearn.feature_selection import RFECV,RFE

logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)

However, I could not find any article which could show how can I perform recursive feature selection in pyspark.

I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.

Could please someone help me achieve this in pyspark

like image 958
Tushar Mehta Avatar asked Nov 28 '18 21:11

Tushar Mehta


People also ask

Why do we use VectorAssembler in PySpark?

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.

What is ChiSqSelector?

ChiSqSelector refers to Chi-Squared feature selection. Basically, it operates on labeled data with categorical features. Moreover, it uses the Chi-Squared test of independence to decide which features to choose. Also supports five selection methods, such as numTopFeatures, percentile, fpr, fdr, fwe.

What is Stringindexer PySpark?

A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.

What is withColumn in PySpark?

withColumn (colName, col)[source] Returns a new DataFrame by adding a column or replacing the existing column that has the same name. The column expression must be an expression over this DataFrame ; attempting to add a column from some other DataFrame will raise an error. New in version 1.3.


2 Answers

You have a few options for doing this.

  • If the model you need is implemented in either Spark's MLlib or spark-sklearn`, you can adapt your code to use the corresponding library.

  • If you can train your model locally and just want to deploy it to make predictions, you can use User Defined Functions (UDFs) or vectorized UDFs to run the trained model on Spark. Here's a good post discussing how to do this.

  • If you need to run an sklearn model on Spark that is not supported by spark-sklearn, you'll need to make sklearn available to Spark on each worker node in your cluster. You can do this by manually installing sklearn on each node in your Spark cluster (make sure you are installing into the Python environment that Spark is using).

  • Alternatively, you can package and distribute the sklearn library with the Pyspark job. In short, you can pip install sklearn into a local directory near your script, then zip the sklearn installation directory and use the --py-files flag of spark-submit to send the zipped sklearn to all workers along with your script. This article has a complete overview of how to accomplish this.

like image 54
Jerry Ding Avatar answered Sep 18 '22 02:09

Jerry Ding


We can try following feature selection methods in pyspark

  • Chi-Squared selector
  • Randomforest selector

References:

  • https://spark.apache.org/docs/2.2.0/ml-features.html#feature-selectors
  • https://databricks.com/session/building-custom-ml-pipelinestages-for-feature-selection
like image 29
Hari Baskar Avatar answered Sep 20 '22 02:09

Hari Baskar