Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark, ML, StringIndexer: handling unseen labels

Tags:

My goal is to build a multicalss classifier.

I have built a pipeline for feature extraction and it includes as a first step a StringIndexer transformer to map each class name to a label, this label will be used in the classifier training step.

The pipeline is fitted the training set.

The test set has to be processed by the fitted pipeline in order to extract the same feature vectors.

Knowing that my test set files have the same structure of the training set. The possible scenario here is to face an unseen class name in the test set, in that case the StringIndexer will fail to find the label, and an exception will be raised.

Is there a solution for this case? or how can we avoid that from happening?

like image 543
Rami Avatar asked Jan 08 '16 16:01

Rami


1 Answers

With Spark 2.2 (released 7-2017) you are able to use the .setHandleInvalid("keep") option when creating the indexer. With this option, the indexer adds new indexes when it sees new labels.

val categoryIndexerModel = new StringIndexer()   .setInputCol("category")   .setOutputCol("indexedCategory")   .setHandleInvalid("keep") // options are "keep", "error" or "skip" 

From the documentation: there are three strategies regarding how StringIndexer will handle unseen labels when you have fit a StringIndexer on one dataset and then use it to transform another:

  • 'error': throws an exception (which is the default)
  • 'skip': skips the rows containing the unseen labels entirely (removes the rows on the output!)
  • 'keep': puts unseen labels in a special additional bucket, at index numLabels

Please see the linked documentation for examples on how the output of StringIndexer looks for the different options.

like image 82
queise Avatar answered Oct 23 '22 13:10

queise