Applying IndexToString to features vector in Spark

Context: I have a data frame where all categorical values have been indexed using StringIndexer.

val categoricalColumns = df.schema.collect { case StructField(name, StringType, nullable, meta) => name }    

val categoryIndexers = categoricalColumns.map {
  col => new StringIndexer().setInputCol(col).setOutputCol(s"${col}Indexed") 
}

Then I used VectorAssembler to vectorize all feature columns (including the indexed categorical ones).

val assembler = new VectorAssembler()
    .setInputCols(dfIndexed.columns.diff(List("label") ++ categoricalColumns))
    .setOutputCol("features")

After applying the classifier and a few additional steps I end up with a data frame that has label, features, and prediction. I would like expand my features vector to separate columns in order to convert the indexed values back to their original String form.

val categoryConverters = categoricalColumns.zip(categoryIndexers).map {
colAndIndexer => new IndexToString().setInputCol(s"${colAndIndexer._1}Indexed").setOutputCol(colAndIndexer._1).setLabels(colAndIndexer._2.fit(df).labels)
}

Question: Is there a simple way of doing this, or is the best approach to somehow attach the prediction column to the test data frame?

What I have tried:

val featureSlicers = categoricalColumns.map {
  col => new VectorSlicer().setInputCol("features").setOutputCol(s"${col}Indexed").setNames(Array(s"${col}Indexed"))
}

Applying this gives me the columns that I want, but they are in Vector form (as it is meant to do) and not type Double.

Edit: The desired output is the original data frame (i.e. categorical features as String not index) with an additional column indicating the predicted label (which in my case is 0 or 1).

For example, say the output of my classifier looked something like this:

+-----+---------+----------+
|label| features|prediction|
+-----+---------+----------+
|  1.0|[0.0,3.0]|       1.0|
+-----+---------+----------+

By applying VectorSlicer on each feature I would get:

+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
|  1.0|[0.0,3.0]|       1.0|        [0.0]|        [3.0]|
+-----+---------+----------+-------------+-------------+

Which is great, but I need:

+-----+---------+----------+-------------+-------------+
|label| features|prediction|statusIndexed|artistIndexed|
+-----+---------+----------+-------------+-------------+
|  1.0|[0.0,3.0]|       1.0|         0.0 |         3.0 |
+-----+---------+----------+-------------+-------------+

To then be able to use IndexToString and convert it to:

+-----+---------+----------+-------------+-------------+
|label| features|prediction|    status   |    artist   |
+-----+---------+----------+-------------+-------------+
|  1.0|[0.0,3.0]|       1.0|        good |  Pink Floyd |
+-----+---------+----------+-------------+-------------+

or even:

+-----+----------+-------------+-------------+
|label|prediction|    status   |    artist   |
+-----+----------+-------------+-------------+
|  1.0|       1.0|        good |  Pink Floyd |
+-----+----------+-------------+-------------+

Why do we use VectorAssembler in spark?

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.

What does vector indexer do?

This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. Set maxCategories to the maximum number of categorical any categorical feature should have.

How do you use Stringindexer PySpark?

A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.

What is vector UDT?

User-defined type for Vector which allows easy interaction with SQL via DataFrame .

Well, it is not a very useful operation but it should be possible to extract required information using column metadata and as simple UDF. I assume your data has been created a pipeline similar to this one:

import org.apache.spark.ml.feature.{VectorSlicer, VectorAssembler, StringIndexer}
import org.apache.spark.ml.Pipeline

val df = sc.parallelize(Seq(
  (1L, "a", "foo", 1.0), (2L, "b", "bar", 2.0), (3L, "a", "bar", 3.0)
)).toDF("id", "x1", "x2", "x3")

val featureCols = Array("x1", "x2", "x3")
val featureColsIdx = featureCols.map(c => s"${c}_i")

val indexers = featureCols.map(
  c => new StringIndexer().setInputCol(c).setOutputCol(s"${c}_i")
)

val assembler = new VectorAssembler()
  .setInputCols(featureColsIdx)
  .setOutputCol("features")

val slicer = new VectorSlicer()
  .setInputCol("features")
  .setOutputCol("string_features")
  .setNames(featureColsIdx.init)


val transformed = new Pipeline()
  .setStages(indexers :+ assembler :+ slicer)
  .fit(df)
  .transform(df)

First we can extract desired metadata from the features:

val meta = transformed.select($"string_features")
  .schema.fields.head.metadata
  .getMetadata("ml_attr") 
  .getMetadata("attrs")
  .getMetadataArray("nominal")

and convert it to something easier to use

case class NominalMetadataWrapper(idx: Long, name: String, vals: Array[String])

// In general it could a good idea to make it a broadcast variable
val lookup = meta.map(m => NominalMetadataWrapper(
  m.getLong("idx"), m.getString("name"), m.getStringArray("vals")
))

Finally a small UDF:

import scala.util.Try

val transFeatures = udf((v: Vector) => lookup.map{
  m => Try(m.vals(v(m.idx.toInt).toInt)).toOption
})

transformed.select(transFeatures($"string_features")).

Applying IndexToString to features vector in Spark

Tags:

scala

apache-spark

apache-spark-ml

gstvolvr

People also ask

1 Answers

zero323

Recent Activity

Donate For Us

Applying IndexToString to features vector in Spark

Tags:

scala

apache-spark

apache-spark-ml

gstvolvr

People also ask

1 Answers

zero323

Related questions

Recent Activity

Donate For Us