Dimension mismatch error in Spark ML

Question

I'm pretty new to both ML and Spark ML, and I'm trying to make a prediction model using neural networks with Spark ML, but I get this error when i call .transform method on my learnt model. The problem is caused by the use of OneHotEncoder, because without it everything works fine. I have tried taking OneHotEncoder out of the pipeline.

My question is: how can I use OneHotEncoder and not get this error?

 java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch! 
 at scala.Predef$.require(Predef.scala:224)     at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41)   at
 org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:163)     at
 org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:482)  at
 org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:529)

My code:

test_pandas_df = pd.read_csv(
    '/home/piotrek/ml/adults/adult.test', names=header, skipinitialspace=True)
train_pandas_df = pd.read_csv(
    '/home/piotrek/ml/adults/adult.data', names=header, skipinitialspace=True)
train_df = sqlContext.createDataFrame(train_pandas_df)
test_df = sqlContext.createDataFrame(test_pandas_df)

joined = train_df.union(test_df)

assembler = VectorAssembler().setInputCols(features).setOutputCol("features")

label_indexer = StringIndexer().setInputCol(
    "label").setOutputCol("label_index")

label_indexer_fit = [label_indexer.fit(joined)]

string_indexers = [StringIndexer().setInputCol(
    name).setOutputCol(name + "_index").fit(joined) for name in categorical_feats]

one_hot_pipeline = Pipeline().setStages([OneHotEncoder().setInputCol(
    name + '_index').setOutputCol(name + '_one_hot') for name in categorical_feats])

mlp = MultilayerPerceptronClassifier().setLabelCol(label_indexer.getOutputCol()).setFeaturesCol(
    assembler.getOutputCol()).setLayers([len(features), 20, 10, 2]).setSeed(42L).setBlockSize(1000).setMaxIter(500)
pipeline = Pipeline().setStages(label_indexer_fit
                                + string_indexers + [one_hot_pipeline] + [assembler, mlp])

model = pipeline.fit(train_df)

# compute accuracy on the test set
result = model.transform(test_df)

## FAILS ON RESULT

predictionAndLabels = result.select("prediction", "label_index")

evaluator = MulticlassClassificationEvaluator(labelCol="label_index")
print "-------------------------------"
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
print "-------------------------------"

Thanks!

zero323 · Accepted Answer

layers Param in your model is not correct:

setLayers([len(features), 20, 10, 2])

The first layer should reflect the number of the input features which in general won't be the same as the number of raw columns before encoding.

If you don't know the total number of features up front you can for example separate feature extraction and model training. Pseudocode:

feature_pipeline_model = (Pipeline()
     .setStages(...)  # Only feature extraction
     .fit(train_df))

train_df_features = feature_pipeline_model.transform(train_df)
layers = [
    train_df_features.schema["features"].metadata["ml_attr"]["num_attrs"],
    20, 10, 2
]

Dimension mismatch error in Spark ML

Tags:

python

machine-learning

apache-spark

pyspark

apache-spark-ml

piotrm50

1 Answers

zero323

Recent Activity

Donate For Us

Dimension mismatch error in Spark ML

Tags:

python

machine-learning

apache-spark

pyspark

apache-spark-ml

piotrm50

1 Answers

zero323

Related questions

Recent Activity

Donate For Us