Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dimension mismatch error in Spark ML

I'm pretty new to both ML and Spark ML, and I'm trying to make a prediction model using neural networks with Spark ML, but I get this error when i call .transform method on my learnt model. The problem is caused by the use of OneHotEncoder, because without it everything works fine. I have tried taking OneHotEncoder out of the pipeline.

My question is: how can I use OneHotEncoder and not get this error?

 java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch! 
 at scala.Predef$.require(Predef.scala:224)     at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41)   at
 org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:163)     at
 org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:482)  at
 org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:529)

My code:

test_pandas_df = pd.read_csv(
    '/home/piotrek/ml/adults/adult.test', names=header, skipinitialspace=True)
train_pandas_df = pd.read_csv(
    '/home/piotrek/ml/adults/adult.data', names=header, skipinitialspace=True)
train_df = sqlContext.createDataFrame(train_pandas_df)
test_df = sqlContext.createDataFrame(test_pandas_df)

joined = train_df.union(test_df)

assembler = VectorAssembler().setInputCols(features).setOutputCol("features")

label_indexer = StringIndexer().setInputCol(
    "label").setOutputCol("label_index")

label_indexer_fit = [label_indexer.fit(joined)]

string_indexers = [StringIndexer().setInputCol(
    name).setOutputCol(name + "_index").fit(joined) for name in categorical_feats]

one_hot_pipeline = Pipeline().setStages([OneHotEncoder().setInputCol(
    name + '_index').setOutputCol(name + '_one_hot') for name in categorical_feats])

mlp = MultilayerPerceptronClassifier().setLabelCol(label_indexer.getOutputCol()).setFeaturesCol(
    assembler.getOutputCol()).setLayers([len(features), 20, 10, 2]).setSeed(42L).setBlockSize(1000).setMaxIter(500)
pipeline = Pipeline().setStages(label_indexer_fit
                                + string_indexers + [one_hot_pipeline] + [assembler, mlp])

model = pipeline.fit(train_df)

# compute accuracy on the test set
result = model.transform(test_df)

## FAILS ON RESULT

predictionAndLabels = result.select("prediction", "label_index")

evaluator = MulticlassClassificationEvaluator(labelCol="label_index")
print "-------------------------------"
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
print "-------------------------------"

Thanks!

like image 348
piotrm50 Avatar asked Feb 17 '17 15:02

piotrm50


1 Answers

layers Param in your model is not correct:

setLayers([len(features), 20, 10, 2])

The first layer should reflect the number of the input features which in general won't be the same as the number of raw columns before encoding.

If you don't know the total number of features up front you can for example separate feature extraction and model training. Pseudocode:

feature_pipeline_model = (Pipeline()
     .setStages(...)  # Only feature extraction
     .fit(train_df))

train_df_features = feature_pipeline_model.transform(train_df)
layers = [
    train_df_features.schema["features"].metadata["ml_attr"]["num_attrs"],
    20, 10, 2
]
like image 100
zero323 Avatar answered Sep 29 '22 09:09

zero323