How do I get the mapping out of a trained Spark MLlib StringIndexerModel?
val stringIndexer = new StringIndexer()
.setInputCol("myCol")
.setOutputCol("myColIdx")
val stringIndexerModel = stringIndexer.fit(data)
val res = stringIndexerModel.transform(data)
The code above will add a myColIdx
to my DataFrame mapping values in myCol
to an index based on the values frequency. i.e. Most frequent value -> 0, second most frequent -> 1, etc...
How do I retrieve that mapping from the model? If I serialize/deserialize the model, will the mapping be stable (i.e. Am I guaranteed to same result after the transform)?
StringIndexerModel
exposes the mapping using labels
attribute:
stringIndexerModel.labels: Array[String]
where values correspond to consecutive labels for example for:
val data = Seq("foo", "bar", "foo", "bar", "foobar", "bar").toDF("myCol")
you'll get following labels
:
import org.apache.spark.ml.feature.IndexToString
Array(bar, foo, foobar)
with bar
indexed as 0.0, foo
as 1.0 and foobar
as 2.0. This is property of the model and is preserved when model is saved
.
When used in Pipeline
you can also use IndexToString
which will use column metadata to map indices back to labels.
indexToString.transform(stringIndexerModel.transform(data)).show
+------+--------+-------------+
| myCol|myColIdx|myColReversed|
+------+--------+-------------+
| foo| 1.0| foo|
| bar| 0.0| bar|
| foo| 1.0| foo|
| bar| 0.0| bar|
|foobar| 2.0| foobar|
| bar| 0.0| bar|
+------+--------+-------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With