I am working on a machine learning model of shape 1,456,354 X 53
. I wanted to do feature selection for my data set. I know how to do feature selection in python
using the following code.
from sklearn.feature_selection import RFECV,RFE
logreg = LogisticRegression()
rfe = RFE(logreg, step=1, n_features_to_select=28)
rfe = rfe.fit(df.values,arrythmia.values)
features_bool = np.array(rfe.support_)
features = np.array(df.columns)
result = features[features_bool]
print(result)
However, I could not find any article which could show how can I perform recursive feature selection in pyspark
.
I tried to import sklearn
libraries in pyspark but it gave me an error sklearn module not found. I am running pyspark on google dataproc cluster.
Could please someone help me achieve this in pyspark
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.
ChiSqSelector refers to Chi-Squared feature selection. Basically, it operates on labeled data with categorical features. Moreover, it uses the Chi-Squared test of independence to decide which features to choose. Also supports five selection methods, such as numTopFeatures, percentile, fpr, fdr, fwe.
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.
withColumn (colName, col)[source] Returns a new DataFrame by adding a column or replacing the existing column that has the same name. The column expression must be an expression over this DataFrame ; attempting to add a column from some other DataFrame will raise an error. New in version 1.3.
You have a few options for doing this.
If the model you need is implemented in either Spark's MLlib
or spark-sklearn`, you can adapt your code to use the corresponding library.
If you can train your model locally and just want to deploy it to make predictions, you can use User Defined Functions (UDFs
) or vectorized UDFs
to run the trained model on Spark. Here's a good post discussing how to do this.
If you need to run an sklearn
model on Spark that is not supported by spark-sklearn, you'll need to make sklearn available to Spark on each worker node in your cluster. You can do this by manually installing sklearn on each node in your Spark cluster (make sure you are installing into the Python environment that Spark is using).
Alternatively, you can package and distribute the sklearn library with the Pyspark
job. In short, you can pip install sklearn
into a local directory near your script, then zip the sklearn installation directory and use the --py-files
flag of spark-submit
to send the zipped sklearn to all workers along with your script. This article has a complete overview of how to accomplish this.
We can try following feature selection methods in pyspark
References:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With