Context: I have a data frame where all categorical values have been indexed using StringIndexer.
val categoricalColumns = df.schema.collect { case StructField(name, StringType, nullable, meta) => name }
val categoryIndexers = categoricalColumns.map {
col => new StringIndexer().setInputCol(col).setOutputCol(s"${col}Indexed")
}
Then I used VectorAssembler to vectorize all feature columns (including the indexed categorical ones).
val assembler = new VectorAssembler()
.setInputCols(dfIndexed.columns.diff(List("label") ++ categoricalColumns))
.setOutputCol("features")
After applying the classifier and a few additional steps I end up with a data frame that has label, features, and prediction. I would like expand my features vector to separate columns in order to convert the indexed values back to their original String form.
val categoryConverters = categoricalColumns.zip(categoryIndexers).map {
colAndIndexer => new IndexToString().setInputCol(s"${colAndIndexer._1}Indexed").setOutputCol(colAndIndexer._1).setLabels(colAndIndexer._2.fit(df).labels)
}
Question: Is there a simple way of doing this, or is the best approach to somehow attach the prediction column to the test data frame?
What I have tried:
val featureSlicers = categoricalColumns.map {
col => new VectorSlicer().setInputCol("features").setOutputCol(s"${col}Indexed").setNames(Array(s"${col}Indexed"))
}
Applying this gives me the columns that I want, but they are in Vector form (as it is meant to do) and not type Double.
Edit: The desired output is the original data frame (i.e. categorical features as String not index) with an additional column indicating the predicted label (which in my case is 0 or 1).
For example, say the output of my classifier looked something like this:
+-----+---------+----------+
|label| features|prediction|
+-----+---------+----------+
| 1.0|[0.0,3.0]| 1.0|
+-----+---------+----------+
By applying VectorSlicer on each feature I would get:
+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
| 1.0|[0.0,3.0]| 1.0| [0.0]| [3.0]|
+-----+---------+----------+-------------+-------------+
Which is great, but I need:
+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
| 1.0|[0.0,3.0]| 1.0| 0.0 | 3.0 |
+-----+---------+----------+-------------+-------------+
To then be able to use IndexToString and convert it to:
+-----+---------+----------+-------------+-------------+
|label| features|prediction| status | artist |
+-----+---------+----------+-------------+-------------+
| 1.0|[0.0,3.0]| 1.0| good | Pink Floyd |
+-----+---------+----------+-------------+-------------+
or even:
+-----+----------+-------------+-------------+
|label|prediction| status | artist |
+-----+----------+-------------+-------------+
| 1.0| 1.0| good | Pink Floyd |
+-----+----------+-------------+-------------+
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.
This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. Set maxCategories to the maximum number of categorical any categorical feature should have.
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.
User-defined type for Vector which allows easy interaction with SQL via DataFrame .
Well, it is not a very useful operation but it should be possible to extract required information using column metadata and as simple UDF. I assume your data has been created a pipeline similar to this one:
import org.apache.spark.ml.feature.{VectorSlicer, VectorAssembler, StringIndexer}
import org.apache.spark.ml.Pipeline
val df = sc.parallelize(Seq(
(1L, "a", "foo", 1.0), (2L, "b", "bar", 2.0), (3L, "a", "bar", 3.0)
)).toDF("id", "x1", "x2", "x3")
val featureCols = Array("x1", "x2", "x3")
val featureColsIdx = featureCols.map(c => s"${c}_i")
val indexers = featureCols.map(
c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_i")
)
val assembler = new VectorAssembler()
.setInputCols(featureColsIdx)
.setOutputCol("features")
val slicer = new VectorSlicer()
.setInputCol("features")
.setOutputCol("string_features")
.setNames(featureColsIdx.init)
val transformed = new Pipeline()
.setStages(indexers :+ assembler :+ slicer)
.fit(df)
.transform(df)
First we can extract desired metadata from the features:
val meta = transformed.select($"string_features")
.schema.fields.head.metadata
.getMetadata("ml_attr")
.getMetadata("attrs")
.getMetadataArray("nominal")
and convert it to something easier to use
case class NominalMetadataWrapper(idx: Long, name: String, vals: Array[String])
// In general it could a good idea to make it a broadcast variable
val lookup = meta.map(m => NominalMetadataWrapper(
m.getLong("idx"), m.getString("name"), m.getStringArray("vals")
))
Finally a small UDF:
import scala.util.Try
val transFeatures = udf((v: Vector) => lookup.map{
m => Try(m.vals(v(m.idx.toInt).toInt)).toOption
})
transformed.select(transFeatures($"string_features")).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With