Using Spark ML transformers I arrived at a DataFrame
where each row looks like this:
Row(object_id, text_features_vector, color_features, type_features)
where text_features
is a sparse vector of term weights, color_features
is a small 20-element (one-hot-encoder) dense vector of colors, and type_features
is also a one-hot-encoder dense vector of types.
What would a good approach be (using Spark's facilities) to merge these features in one single, large array, so that I measure things like the cosine distance between any two objects?
A simple way is to flatten the output image matrix into a vector, and then you can combine this vector and the other extracted features.
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.
You can use VectorAssembler:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.DataFrame
val df: DataFrame = ???
val assembler = new VectorAssembler()
.setInputCols(Array("text_features", "color_features", "type_features"))
.setOutputCol("features")
val transformed = assembler.transform(df)
For PySpark example see: Encode and assemble multiple features in PySpark
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With