Create feature vector programmatically in Spark ML / pyspark

Tags:

I'm wondering if there is a concise way to run ML (e.g KMeans) on a DataFrame in pyspark if I have the features in multiple numeric columns.

I.e. as in the Iris dataset:

(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)

I'd like to use KMeans without recreating the DataSet with the feature vector added manually as a new column and the original columns hardcoded repeatedly in the code.

The solution I'd like to improve:

from pyspark.mllib.linalg import Vectors from pyspark.sql.types import Row from pyspark.ml.clustering import KMeans, KMeansModel  iris = sqlContext.read.parquet("/opt/data/iris.parquet") iris.first() # Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, id=u'id_1', label=u'Iris-setosa', binomial_label=1)  df = iris.map(lambda r: Row(                     id = r.id,                     a1 = r.a1,                     a2 = r.a2,                     a3 = r.a3,                     a4 = r.a4,                     label = r.label,                     binomial_label=r.binomial_label,                     features = Vectors.dense(r.a1, r.a2, r.a3, r.a4))                     ).toDF()   kmeans_estimator = KMeans()\     .setFeaturesCol("features")\     .setPredictionCol("prediction")\ kmeans_transformer = kmeans_estimator.fit(df)  predicted_df = kmeans_transformer.transform(df).drop("features") predicted_df.first() # Row(a1=5.1, a2=3.5, a3=1.4, a4=0.2, binomial_label=1, id=u'id_1', label=u'Iris-setosa', prediction=1)

I'm looking for a solution, which is something like:

feature_cols = ["a1", "a2", "a3", "a4"] prediction_col_name = "prediction" <dataframe independent code for KMeans> <New dataframe is created, extended with the `prediction` column.>

751

asked Sep 16 '15 10:09

zoltanctoth

1 Answers

You can use VectorAssembler:

from pyspark.ml.feature import VectorAssembler  ignore = ['id', 'label', 'binomial_label'] assembler = VectorAssembler(     inputCols=[x for x in df.columns if x not in ignore],     outputCol='features')  assembler.transform(df)

It can be combined with k-means using ML Pipeline:

from pyspark.ml import Pipeline  pipeline = Pipeline(stages=[assembler, kmeans_estimator]) model = pipeline.fit(df)

191

answered Sep 29 '22 13:09

zero323

Related questions
                            
                                Babel 6.0.20 Modules feature not work in IE8
                            
                                C: Size of two dimensional array
                            
                                Check if a pandas Series has at least one item greater than a value
                            
                                Pandas dropna - store dropped rows
                            
                                Zero-reinitializing a struct in C++
                            
                                add build parameter in jenkins build schedule
                            
                                Changing the document title in React?
                            
                                Equivalent of php_value under Apache + php-fpm
                            
                                Swap two numbers golang
                            
                                fatal error: 'Python.h' file not found while installing opencv
                            
                                Intervention / Image Upload Error {{ Image source not readable }}
                            
                                What's the right way to fix this template resolution ambiguity?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With