How to get feature vector column length in Spark Pipeline

Question

I have an interesting question.

I am using Pipeline object to run a ML task.

This is how my Pipeline object looks like.

jpsa_mlp.pipeline.getStages()
Out[244]:
[StringIndexer_479d82259c10308d0587,
 Tokenizer_4c5ca5ea35544bb835cb,
 StopWordsRemover_4641b68e77f00c8fbb91,
 CountVectorizer_468c96c6c714b1000eef,
 IDF_465eb809477c6c986ef9,
 MultilayerPerceptronClassifier_4a67befe93b015d5bd07]

All the estimators and transformers inside this pipeline object have been coded as part of class methods with JPSA being class object.

Now I want to put a method for hyper parameter tuning. So I use below:

 self.paramGrid = ParamGridBuilder()\
            .addGrid(self.pipeline.getStages()[5].layers, [len(self.pipeline.getStages()[3].vocab),10,3])\
            .addGrid(self.pipeline.getStages()[5].maxIter, [100,300])\
            .build()

The problem is for a Neural Network classifier one of the hyper parameter is basically the hidden layer size. The layers attribute of MLP classifier requires the size of input layer, hidden and output layer. Input and Output is fixed (based on data we have). So I wanted to put input layer size as the size of my feature vector. However I don't know the size of my feature vector because the estimator inside the pipeline object to create feature vectors (Count Vectorizer, IDF) have not been fit yet to the data.

The pipeline object will fit the data during cross validation by using a cross validator object of Spark. Then only I would be able to have CountVectorizerModel to know the feature vector size.

If I had Countvectorizer materialized then I can use either the countvectorizerModel.vocab to get the length of the feature vector and use that as a parameter for input layer value in layers attribute of mlp.

SO then how do I add hyper parameters for Layers for mlp (both the hidden and input layer size)?

Fermat's Little Student · Accepted Answer

You can find out that information from your dataframe schema metadata.

Scala code:

val length = datasetAfterPipe.schema(datasetAfterPipe.schema.fieldIndex("columnName"))
    .metadata.getMetadata("ml_attr").getLong("num_attrs")

How to get feature vector column length in Spark Pipeline

Tags:

python

apache-spark

pyspark

Baktaawar

1 Answers

Fermat's Little Student

Recent Activity

Donate For Us

How to get feature vector column length in Spark Pipeline

Tags:

python

apache-spark

pyspark

Baktaawar

1 Answers

Fermat's Little Student

Related questions

Recent Activity

Donate For Us