Getting labels from StringIndexer stages within pipeline in Spark (pyspark)

Tags:

I am using Spark and pyspark and I have a pipeline set up with a bunch of StringIndexer objects, that I use to encode the string columns to columns of indices:

indexers = [StringIndexer(inputCol=column, outputCol=column + '_index').setHandleInvalid('skip')
            for column in list(set(data_frame.columns) - ignore_columns)]
pipeline = Pipeline(stages=indexers)
new_data_frame = pipeline.fit(data_frame).transform(data_frame)

The problem is, that I need to get the list of labels for each StringIndexer object after it gets fitted. For a single column and a single StringIndexer without a pipeline, it's an easy task. I can just access the labels attribute after fitting the indexer on the DataFrame:

indexer = StringIndexer(inputCol="name", outputCol="name_index")
indexer_fitted = indexer.fit(data_frame)
labels = indexer_fitted.labels
new_data_frame = indexer_fitted.transform(data_frame)

However when I use the pipeline, this doesn't seem possible, or at least I don't know how to do this.

So I guess my question comes down to: Is there a way to access the labels that were used during the indexing process for each individual column?

Or will I have to ditch the pipeline in this use-case, and for example loop through the list of StringIndexer objects and do it manually? (I'm sure that would possible. However using the pipeline would just be a lot nicer)

862

asked Aug 25 '17 15:08

ksbg

1 Answers

Example data and Pipeline:

from pyspark.ml.feature import StringIndexer, StringIndexerModel

df = spark.createDataFrame([("a", "foo"), ("b", "bar")], ("x1", "x2"))

pipeline = Pipeline(stages=[
    StringIndexer(inputCol=c, outputCol='{}_index'.format(c))
    for c in df.columns
])

model = pipeline.fit(df)

Extract from stages:

# Accessing _java_obj shouldn't be necessary in Spark 2.3+
{x._java_obj.getOutputCol(): x.labels 
for x in model.stages if isinstance(x, StringIndexerModel)}

{'x1_index': ['a', 'b'], 'x2_index': ['foo', 'bar']}

From metadata of the transformed DataFrame:

indexed = model.transform(df)

{c.name: c.metadata["ml_attr"]["vals"]
for c in indexed.schema.fields if c.name.endswith("_index")}

{'x1_index': ['a', 'b'], 'x2_index': ['foo', 'bar']}

answered Oct 27 '22 15:10

zero323

Related questions
                            
                                tensorflow: check if a scalar boolean tensor is True
                            
                                Python output above the last printed line
                            
                                Pandas: Fill NaNs with next non-NaN / # consecutive NaNs
                            
                                How to put all legend entries on one line?
                            
                                How do I use an InfiniBand network with Dask?
                            
                                Matplotlib change colormap tab20 to have three colors
                            
                                How to annotate Django view's methods?
                            
                                How to Add item to string_set on Dynamodb with Boto3
                            
                                BeautifulSoup.find_all() method not working with namespaced tags
                            
                                Python BeautifulSoup, iterating through tags and attributes
                            
                                Vim and python - jump to definition key binding
                            
                                ConfigParser - Create file if it doesn't exist
                            
                                Python decorators count function call
                            
                                Fitting a polynomial using np.polyfit in 3 dimensions
                            
                                Cannot chain find and find_all in BeautifulSoup
                            
                                Apache2 "Response header name '<!--' contains invalid characters, aborting request"
                            
                                What's the difference with opencv, python-opencv, and libopencv?
                            
                                How to iterate over this n-dimensional dataset?
                            
                                Looping over groups in a grouped dataframe
                            
                                Pass variables from Scala to Python in Databricks

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting labels from StringIndexer stages within pipeline in Spark (pyspark)

Tags:

python

apache-spark

pyspark

ksbg

People also ask

1 Answers

zero323

Recent Activity

Donate For Us