I was trying to build Logistic regression model on a sample data.
The output from the model we can get are the weights of features used to build the model.
I could not find Spark API for standard error of estimate, Wald-Chi Square statistic, p-value etc.
I am pasting my codes below as an example
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
val sc = new SparkContext(new SparkConf().setAppName("SparkTest").setMaster("local[*]"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
val data: RDD[String] = sc.textFile("C:/Users/user/Documents/spark-1.5.1-bin-hadoop2.4/data/mllib/credit_approval_2_attr.csv")
val parsedData = data.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts(0), Vectors.dense(parts.tail))
}
//Splitting the data
val splits: Array[RDD[LabeledPoint]] = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
val training: RDD[LabeledPoint] = splits(0).cache()
val test: RDD[LabeledPoint] = splits(1)
// Run training algorithm to build the model
val model = new LogisticRegressionWithLBFGS()
.setNumClasses(2)
.run(training)
// Clear the prediction threshold so the model will return probabilities
model.clearThreshold
print(model.weights)
The model weight output is
[-0.03335987643613915,0.025215092730373874,0.22617842810253946,0.29415985532104943,-0.0025559467210279694,4.5242237280512646E-4]
just an array of weights.
Although I was able to calculate Precision, Recall, Accuracy, Sensitivity and other model diagnostics.
Is there a way I can calculate standard error of estimate, Wald-Chi Square statistic, p-value in Spark?
I am concerned since there is a standard output in R or SAS.
Does this have to do something with the optimization method we are using in Spark?
Here we use L-BFGS or SGD.
May be I am not aware of the evaluation methodology.
Any suggestion will be highly appreciated.
The Wald test (a.k.a. Wald Chi-Squared Test) is a parametric statistical measure to confirm whether a set of independent variables are collectively 'significant' for a model or not. It is also used for confirming whether each independent variable present in a model is significant or not.
LR Chi-Square = Dev0 – DevM = 41.18 – 25.78 = 15.40. If the null hypothesis is true, i.e. if all coefficients (other than the constant) equal 0 then the model chi-square statistic has a chi-square distribution with k degrees of freedom (k = number coefficients estimated other than the constant).
Wald test as multi-variable generalization of student's t-test tests the statistical difference of mean between groups. Chi-squared test on the other hand tests the statistical difference of frequency between groups . Their calculations are similar with difference of denominator:variance (Wald) vs mean (Chi-square).
Re: Wald Chi Square statistics - Logistic Regression 1. Chi Square statistics = ((Beta - 0)/ Std error)^2, here beta is the coefficient which we are testing against the null hypothesis that it is 0. The part of formula (Beta - 0)/ Std error), is same as for t-statistics.
Following method will provide details of chi square test -
Statistics.chiSqTest(data)
Input data
val obs: RDD[LabeledPoint] =
sc.parallelize(
Seq(
LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)),
LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 0.0)),
LabeledPoint(-1.0, Vectors.dense(-1.0, 0.0, -0.5)
)
)
)
val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)
Returns an array containing the ChiSquaredTestResult for every feature against the label.
summary of the test including the p-value, degrees of freedom, test statistic, the method used, and the null hypothesis.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With