I'm pretty new to both ML and Spark ML, and I'm trying to make a prediction model using neural networks with Spark ML, but I get this error when i call .transform
method on my learnt model. The problem is caused by the use of OneHotEncoder, because without it everything works fine.
I have tried taking OneHotEncoder out of the pipeline.
My question is: how can I use OneHotEncoder and not get this error?
java.lang.IllegalArgumentException: requirement failed: A & B Dimension mismatch!
at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.ann.BreezeUtil$.dgemm(BreezeUtil.scala:41) at
org.apache.spark.ml.ann.AffineLayerModel.eval(Layer.scala:163) at
org.apache.spark.ml.ann.FeedForwardModel.forward(Layer.scala:482) at
org.apache.spark.ml.ann.FeedForwardModel.predict(Layer.scala:529)
My code:
test_pandas_df = pd.read_csv(
'/home/piotrek/ml/adults/adult.test', names=header, skipinitialspace=True)
train_pandas_df = pd.read_csv(
'/home/piotrek/ml/adults/adult.data', names=header, skipinitialspace=True)
train_df = sqlContext.createDataFrame(train_pandas_df)
test_df = sqlContext.createDataFrame(test_pandas_df)
joined = train_df.union(test_df)
assembler = VectorAssembler().setInputCols(features).setOutputCol("features")
label_indexer = StringIndexer().setInputCol(
"label").setOutputCol("label_index")
label_indexer_fit = [label_indexer.fit(joined)]
string_indexers = [StringIndexer().setInputCol(
name).setOutputCol(name + "_index").fit(joined) for name in categorical_feats]
one_hot_pipeline = Pipeline().setStages([OneHotEncoder().setInputCol(
name + '_index').setOutputCol(name + '_one_hot') for name in categorical_feats])
mlp = MultilayerPerceptronClassifier().setLabelCol(label_indexer.getOutputCol()).setFeaturesCol(
assembler.getOutputCol()).setLayers([len(features), 20, 10, 2]).setSeed(42L).setBlockSize(1000).setMaxIter(500)
pipeline = Pipeline().setStages(label_indexer_fit
+ string_indexers + [one_hot_pipeline] + [assembler, mlp])
model = pipeline.fit(train_df)
# compute accuracy on the test set
result = model.transform(test_df)
## FAILS ON RESULT
predictionAndLabels = result.select("prediction", "label_index")
evaluator = MulticlassClassificationEvaluator(labelCol="label_index")
print "-------------------------------"
print("Test set accuracy = " + str(evaluator.evaluate(predictionAndLabels)))
print "-------------------------------"
Thanks!
layers
Param
in your model is not correct:
setLayers([len(features), 20, 10, 2])
The first layer should reflect the number of the input features which in general won't be the same as the number of raw columns before encoding.
If you don't know the total number of features up front you can for example separate feature extraction and model training. Pseudocode:
feature_pipeline_model = (Pipeline()
.setStages(...) # Only feature extraction
.fit(train_df))
train_df_features = feature_pipeline_model.transform(train_df)
layers = [
train_df_features.schema["features"].metadata["ml_attr"]["num_attrs"],
20, 10, 2
]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With