Feature Selection in PySpark

Tags:

I am working on a machine learning model of shape 1,456,354 X 53. I wanted to do feature selection for my data set. I know how to do feature selection in python using the following code.

from sklearn.feature_selection import RFECV,RFE

logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)

However, I could not find any article which could show how can I perform recursive feature selection in pyspark.

I tried to import sklearn libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.

Could please someone help me achieve this in pyspark

958

asked Nov 28 '18 21:11

Tushar Mehta

2 Answers

You have a few options for doing this.

If the model you need is implemented in either Spark's MLlib or spark-sklearn`, you can adapt your code to use the corresponding library.
If you can train your model locally and just want to deploy it to make predictions, you can use User Defined Functions (UDFs) or vectorized UDFs to run the trained model on Spark. Here's a good post discussing how to do this.
If you need to run an sklearn model on Spark that is not supported by spark-sklearn, you'll need to make sklearn available to Spark on each worker node in your cluster. You can do this by manually installing sklearn on each node in your Spark cluster (make sure you are installing into the Python environment that Spark is using).
Alternatively, you can package and distribute the sklearn library with the Pyspark job. In short, you can pip install sklearn into a local directory near your script, then zip the sklearn installation directory and use the --py-files flag of spark-submit to send the zipped sklearn to all workers along with your script. This article has a complete overview of how to accomplish this.

answered Sep 18 '22 02:09

Jerry Ding

We can try following feature selection methods in pyspark

Chi-Squared selector
Randomforest selector

References:

https://spark.apache.org/docs/2.2.0/ml-features.html#feature-selectors
https://databricks.com/session/building-custom-ml-pipelinestages-for-feature-selection

answered Sep 20 '22 02:09

Hari Baskar

Related questions
                            
                                Convert pandas dataframe to tuple of tuples
                            
                                __subclasses__ not showing anything
                            
                                Automatic dictionary key resolution with nested schemas using Marshmallow
                            
                                Sending Telegram messages with Telethon: some entity parameters work, others don't?
                            
                                A command without name, in Click
                            
                                Upgrade to Airflow 1.10 - _mysql_exceptions.OperationalError: (1054, "Unknown column 'task_instance.executor_config' in 'field list'")
                            
                                Does the quantile() function in Pandas ignore NaN?
                            
                                How do you make buttons of the python library "urwid" look pretty?
                            
                                How to save a dictionary as a JSON file?
                            
                                Django-filters does not work with the Viewset
                            
                                How to change the dag bag folder for Airflow web ui?
                            
                                Chrome browser initiated through ChromeDriver gets detected
                            
                                Speeding up loop when normalizing Pandas data
                            
                                cvxpy stlibc++ Installation error on MacOS Mojave
                            
                                MNIST data download from sklearn datasets gives Timeout error
                            
                                Merge two DataFrames based on columns and values of a specific column with Pandas in Python 3.x
                            
                                Stop showing plots in spyder
                            
                                Python, Selenium find element with class and wait for class change
                            
                                Audio Frequencies in Python
                            
                                del() vs del statement in python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Feature Selection in PySpark

Tags:

python

machine-learning

pyspark

feature-selection

google-cloud-dataproc

Tushar Mehta

People also ask

2 Answers

Jerry Ding

Hari Baskar

Recent Activity

Donate For Us