How to merge multiple feature vectors in DataFrame?

Tags:

Using Spark ML transformers I arrived at a DataFrame where each row looks like this:

Row(object_id, text_features_vector, color_features, type_features)

where text_features is a sparse vector of term weights, color_features is a small 20-element (one-hot-encoder) dense vector of colors, and type_features is also a one-hot-encoder dense vector of types.

What would a good approach be (using Spark's facilities) to merge these features in one single, large array, so that I measure things like the cosine distance between any two objects?

250

asked Oct 22 '15 05:10

Felipe

1 Answers

You can use VectorAssembler:

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.DataFrame

val df: DataFrame = ???

val assembler = new VectorAssembler()
  .setInputCols(Array("text_features", "color_features", "type_features"))
  .setOutputCol("features")

val transformed = assembler.transform(df)

For PySpark example see: Encode and assemble multiple features in PySpark

answered Sep 28 '22 05:09

zero323

Related questions
                            
                                Spark2.2.1 incompatible Jackson version 2.8.8
                            
                                Passing command line arguments to Spark-shell
                            
                                How to drop multiple column names given in a list from Spark DataFrame?
                            
                                Failed to start master for Spark in Windows
                            
                                How to exit spark-submit after the submission
                            
                                Spark Random Forests: Different results with same seed
                            
                                Does Spark support Partition Pruning with Parquet Files
                            
                                Spark Kafka Direct DStream - How many executors and RDD partitions in yarn-cluster mode if num-executors is set?
                            
                                Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk
                            
                                Why does Spark's OneHotEncoder drop the last category by default?
                            
                                Does collect_list() maintain relative ordering of rows?
                            
                                org.apache.spark.SparkException: Job aborted due to stage failure: Task from application
                            
                                "sparkContext was shut down" while running spark on a large dataset
                            
                                Total size of serialized results of tasks is bigger than spark.driver.maxResultSize
                            
                                Spark 2.0 deprecates 'DirectParquetOutputCommitter', how to live without it?
                            
                                What is the best way to remove accents with Apache Spark dataframes in PySpark?
                            
                                Hash function in spark
                            
                                Spark - Which instance type is preferred for AWS EMR cluster? [closed]
                            
                                Spark losing println() on stdout
                            
                                How to stop a running SparkContext before opening the new one

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to merge multiple feature vectors in DataFrame?

Tags:

machine-learning

apache-spark

apache-spark-sql

apache-spark-ml

Felipe

People also ask

1 Answers

zero323

Recent Activity

Donate For Us