Calculating standard error of estimate, Wald-Chi Square statistic, p-value with logistic regression in Spark

Tags:

I was trying to build Logistic regression model on a sample data.

The output from the model we can get are the weights of features used to build the model.

I could not find Spark API for standard error of estimate, Wald-Chi Square statistic, p-value etc.

I am pasting my codes below as an example

import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
import org.apache.spark.mllib.evaluation.{BinaryClassificationMetrics, MulticlassMetrics}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}


    val sc = new SparkContext(new SparkConf().setAppName("SparkTest").setMaster("local[*]"))

    val sqlContext = new org.apache.spark.sql.SQLContext(sc);

    val data: RDD[String] = sc.textFile("C:/Users/user/Documents/spark-1.5.1-bin-hadoop2.4/data/mllib/credit_approval_2_attr.csv")


    val parsedData = data.map { line =>
      val parts = line.split(',').map(_.toDouble)
      LabeledPoint(parts(0), Vectors.dense(parts.tail))
    }

    //Splitting the data
    val splits: Array[RDD[LabeledPoint]] = parsedData.randomSplit(Array(0.7, 0.3), seed = 11L)
    val training: RDD[LabeledPoint] = splits(0).cache()
    val test: RDD[LabeledPoint] = splits(1)



    // Run training algorithm to build the model
    val model = new LogisticRegressionWithLBFGS()
      .setNumClasses(2)
      .run(training)
    // Clear the prediction threshold so the model will return probabilities
    model.clearThreshold
    print(model.weights)

The model weight output is

[-0.03335987643613915,0.025215092730373874,0.22617842810253946,0.29415985532104943,-0.0025559467210279694,4.5242237280512646E-4]

just an array of weights.

Although I was able to calculate Precision, Recall, Accuracy, Sensitivity and other model diagnostics.

Is there a way I can calculate standard error of estimate, Wald-Chi Square statistic, p-value in Spark?

I am concerned since there is a standard output in R or SAS.

Does this have to do something with the optimization method we are using in Spark?

Here we use L-BFGS or SGD.

May be I am not aware of the evaluation methodology.

Any suggestion will be highly appreciated.

546

asked Jun 14 '16 15:06

PARTHA TALUKDER

1 Answers

Following method will provide details of chi square test -

Statistics.chiSqTest(data)

Input data

val obs: RDD[LabeledPoint] =
      sc.parallelize(
        Seq(
          LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0)),
          LabeledPoint(1.0, Vectors.dense(1.0, 2.0, 0.0)),
          LabeledPoint(-1.0, Vectors.dense(-1.0, 0.0, -0.5)
          )
        )
      )
val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)

Returns an array containing the ChiSquaredTestResult for every feature against the label.

summary of the test including the p-value, degrees of freedom, test statistic, the method used, and the null hypothesis.

114

answered Sep 29 '22 14:09

Vivek Gupta

Related questions
                            
                                Spark-submit fails to import SparkContext
                            
                                How do I get a PySpark DataFrame made using HiveContext in Spark 1.5.2?
                            
                                Pyspark Column.isin() for a large set
                            
                                How to get iPython inbuild magic command to work in Jupyter notebook Pyspark kernel?
                            
                                Using Pycuda with PySpark - nvcc not found
                            
                                AttributeError: 'NoneType' object has no attribute 'sc'
                            
                                Cosine similarity of word2vec more than 1
                            
                                How to write a dataframe in pyspark having null values to CSV
                            
                                How much copies of the environment does spark do?
                            
                                Python multiprocessing tool vs Py(Spark)
                            
                                Pyspark groupby then sort within group
                            
                                python spark: narrowing down most relevant features using PCA
                            
                                Why is groupBy() a lot faster than distinct() in pyspark?
                            
                                How to apply the describe function after grouping a PySpark DataFrame?
                            
                                How to log/print message in pyspark pandas_udf?
                            
                                py4JJava Error - error while using select statement
                            
                                Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator
                            
                                How can I inspect per executor/node memory usage metrics of a pyspark job on Dataproc?
                            
                                Partitions not being pruned in simple SparkSQL queries

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculating standard error of estimate, Wald-Chi Square statistic, p-value with logistic regression in Spark

Tags:

logistic-regression

pyspark

standard-error

apache-spark-mllib

PARTHA TALUKDER

People also ask

1 Answers

Vivek Gupta

Recent Activity

Donate For Us