Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get feature vector column length in Spark Pipeline

I have an interesting question.

I am using Pipeline object to run a ML task.

This is how my Pipeline object looks like.

jpsa_mlp.pipeline.getStages()
Out[244]:
[StringIndexer_479d82259c10308d0587,
 Tokenizer_4c5ca5ea35544bb835cb,
 StopWordsRemover_4641b68e77f00c8fbb91,
 CountVectorizer_468c96c6c714b1000eef,
 IDF_465eb809477c6c986ef9,
 MultilayerPerceptronClassifier_4a67befe93b015d5bd07]

All the estimators and transformers inside this pipeline object have been coded as part of class methods with JPSA being class object.

Now I want to put a method for hyper parameter tuning. So I use below:

 self.paramGrid = ParamGridBuilder()\
            .addGrid(self.pipeline.getStages()[5].layers, [len(self.pipeline.getStages()[3].vocab),10,3])\
            .addGrid(self.pipeline.getStages()[5].maxIter, [100,300])\
            .build()

The problem is for a Neural Network classifier one of the hyper parameter is basically the hidden layer size. The layers attribute of MLP classifier requires the size of input layer, hidden and output layer. Input and Output is fixed (based on data we have). So I wanted to put input layer size as the size of my feature vector. However I don't know the size of my feature vector because the estimator inside the pipeline object to create feature vectors (Count Vectorizer, IDF) have not been fit yet to the data.

The pipeline object will fit the data during cross validation by using a cross validator object of Spark. Then only I would be able to have CountVectorizerModel to know the feature vector size.

If I had Countvectorizer materialized then I can use either the countvectorizerModel.vocab to get the length of the feature vector and use that as a parameter for input layer value in layers attribute of mlp.

SO then how do I add hyper parameters for Layers for mlp (both the hidden and input layer size)?

like image 220
Baktaawar Avatar asked Oct 17 '22 15:10

Baktaawar


1 Answers

You can find out that information from your dataframe schema metadata.

Scala code:

val length = datasetAfterPipe.schema(datasetAfterPipe.schema.fieldIndex("columnName"))
    .metadata.getMetadata("ml_attr").getLong("num_attrs")
like image 150
Fermat's Little Student Avatar answered Oct 20 '22 11:10

Fermat's Little Student