Use the result from Cross tab (spark dataframe) for chi-square test in SparkMlib

Question

I've generated a dataframe data from crosstab in Spark DataFrame and want to perform the chi-squared test.

It seems that Statistics.chiSqTest can only be applied to a matrix. My DataFrame looks like as below and I want to see whether the level distribution is the same across three groups:

true
false
and Undefined.

from pyspark.mllib.stat import Statistics


+-----------------------------+-------+--------+----------+
|levels                       |   true|   false|Undefined |
+-----------------------------+-------+--------+----------+
|                         1   |32783  |634460  |2732340   |
|                         2   | 2139  | 41248  |54855     |
|                         3   |28837  |573746  |5632147   |
|                         4   |16473  |320529  |8852552   |
+-----------------------------+-------+--------+----------+

Is there any easy way to transform this in order to be used for chi-squared test?

zero323 · Accepted Answer

One way to handle this without using mllib.Statistics:

import scipy.stats

crosstab = ...  
scipy.stats.chi2_contingency(
  crosstab.drop(crosstab.columns[0]).toPandas().as_matrix()
)

If you really want Spark statistics:

from itertools import chain

Statistics.chiSqTest(DenseMatrix(
    numRows=crosstab.count(), numCols=len(crosstab.columns) - 1, 
    values=list(chain(*zip(*crosstab.drop(crosstab.columns[0]).collect())))
))

Use the result from Cross tab (spark dataframe) for chi-square test in SparkMlib

Tags:

python

apache-spark

apache-spark-sql

pyspark

apache-spark-mllib

MYjx

1 Answers

zero323

Recent Activity

Donate For Us

Use the result from Cross tab (spark dataframe) for chi-square test in SparkMlib

Tags:

python

apache-spark

apache-spark-sql

pyspark

apache-spark-mllib

MYjx

1 Answers

zero323

Related questions

Recent Activity

Donate For Us