I am using Spark
and pyspark
and I have a pipeline
set up with a bunch of StringIndexer
objects, that I use to encode the string columns to columns of indices:
indexers = [StringIndexer(inputCol=column, outputCol=column + '_index').setHandleInvalid('skip')
for column in list(set(data_frame.columns) - ignore_columns)]
pipeline = Pipeline(stages=indexers)
new_data_frame = pipeline.fit(data_frame).transform(data_frame)
The problem is, that I need to get the list of labels for each StringIndexer
object after it gets fitted. For a single column and a single StringIndexer
without a pipeline, it's an easy task. I can just access the labels
attribute after fitting the indexer on the DataFrame
:
indexer = StringIndexer(inputCol="name", outputCol="name_index")
indexer_fitted = indexer.fit(data_frame)
labels = indexer_fitted.labels
new_data_frame = indexer_fitted.transform(data_frame)
However when I use the pipeline, this doesn't seem possible, or at least I don't know how to do this.
So I guess my question comes down to: Is there a way to access the labels that were used during the indexing process for each individual column?
Or will I have to ditch the pipeline in this use-case, and for example loop through the list of StringIndexer
objects and do it manually? (I'm sure that would possible. However using the pipeline would just be a lot nicer)
A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.
This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. Set maxCategories to the maximum number of categorical any categorical feature should have.
VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.
String Indexer - Use it if you want the Machine Learning algorithm to identify column as categorical variable or if want to convert the textual data to numeric data keeping the categorical context.
Example data and Pipeline
:
from pyspark.ml.feature import StringIndexer, StringIndexerModel
df = spark.createDataFrame([("a", "foo"), ("b", "bar")], ("x1", "x2"))
pipeline = Pipeline(stages=[
StringIndexer(inputCol=c, outputCol='{}_index'.format(c))
for c in df.columns
])
model = pipeline.fit(df)
Extract from stages
:
# Accessing _java_obj shouldn't be necessary in Spark 2.3+
{x._java_obj.getOutputCol(): x.labels
for x in model.stages if isinstance(x, StringIndexerModel)}
{'x1_index': ['a', 'b'], 'x2_index': ['foo', 'bar']}
From metadata of the transformed DataFrame
:
indexed = model.transform(df)
{c.name: c.metadata["ml_attr"]["vals"]
for c in indexed.schema.fields if c.name.endswith("_index")}
{'x1_index': ['a', 'b'], 'x2_index': ['foo', 'bar']}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With