Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting labels from StringIndexer stages within pipeline in Spark (pyspark)

I am using Spark and pyspark and I have a pipeline set up with a bunch of StringIndexer objects, that I use to encode the string columns to columns of indices:

indexers = [StringIndexer(inputCol=column, outputCol=column + '_index').setHandleInvalid('skip')
            for column in list(set(data_frame.columns) - ignore_columns)]
pipeline = Pipeline(stages=indexers)
new_data_frame = pipeline.fit(data_frame).transform(data_frame)

The problem is, that I need to get the list of labels for each StringIndexer object after it gets fitted. For a single column and a single StringIndexer without a pipeline, it's an easy task. I can just access the labels attribute after fitting the indexer on the DataFrame:

indexer = StringIndexer(inputCol="name", outputCol="name_index")
indexer_fitted = indexer.fit(data_frame)
labels = indexer_fitted.labels
new_data_frame = indexer_fitted.transform(data_frame)

However when I use the pipeline, this doesn't seem possible, or at least I don't know how to do this.

So I guess my question comes down to: Is there a way to access the labels that were used during the indexing process for each individual column?

Or will I have to ditch the pipeline in this use-case, and for example loop through the list of StringIndexer objects and do it manually? (I'm sure that would possible. However using the pipeline would just be a lot nicer)

like image 862
ksbg Avatar asked Aug 25 '17 15:08

ksbg


People also ask

How do you use the Stringindexer in PySpark?

A label indexer that maps a string column of labels to an ML column of label indices. If the input column is numeric, we cast it to string and index the string values. The indices are in [0, numLabels). By default, this is ordered by label frequencies so the most frequent label gets index 0.

What is Vectorindexer PySpark?

This helps process a dataset of unknown vectors into a dataset with some continuous features and some categorical features. The choice between continuous and categorical is based upon a maxCategories parameter. Set maxCategories to the maximum number of categorical any categorical feature should have.

What is VectorAssembler PySpark?

VectorAssembler is a transformer that combines a given list of columns into a single vector column. It is useful for combining raw features and features generated by different feature transformers into a single feature vector, in order to train ML models like logistic regression and decision trees.

Why do we use string indexer?

String Indexer - Use it if you want the Machine Learning algorithm to identify column as categorical variable or if want to convert the textual data to numeric data keeping the categorical context.


1 Answers

Example data and Pipeline:

from pyspark.ml.feature import StringIndexer, StringIndexerModel

df = spark.createDataFrame([("a", "foo"), ("b", "bar")], ("x1", "x2"))

pipeline = Pipeline(stages=[
    StringIndexer(inputCol=c, outputCol='{}_index'.format(c))
    for c in df.columns
])

model = pipeline.fit(df)

Extract from stages:

# Accessing _java_obj shouldn't be necessary in Spark 2.3+
{x._java_obj.getOutputCol(): x.labels 
for x in model.stages if isinstance(x, StringIndexerModel)}
{'x1_index': ['a', 'b'], 'x2_index': ['foo', 'bar']}

From metadata of the transformed DataFrame:

indexed = model.transform(df)

{c.name: c.metadata["ml_attr"]["vals"]
for c in indexed.schema.fields if c.name.endswith("_index")}
{'x1_index': ['a', 'b'], 'x2_index': ['foo', 'bar']}
like image 95
zero323 Avatar answered Oct 27 '22 15:10

zero323