I've generated a dataframe data
from crosstab
in Spark DataFrame
and want to perform the chi-squared test.
It seems that Statistics.chiSqTest
can only be applied to a matrix. My DataFrame
looks like as below and I want to see whether the level distribution is the same across three groups:
true
false
Undefined
.from pyspark.mllib.stat import Statistics
+-----------------------------+-------+--------+----------+
|levels | true| false|Undefined |
+-----------------------------+-------+--------+----------+
| 1 |32783 |634460 |2732340 |
| 2 | 2139 | 41248 |54855 |
| 3 |28837 |573746 |5632147 |
| 4 |16473 |320529 |8852552 |
+-----------------------------+-------+--------+----------+
Is there any easy way to transform this in order to be used for chi-squared test?
One way to handle this without using mllib.Statistics
:
import scipy.stats
crosstab = ...
scipy.stats.chi2_contingency(
crosstab.drop(crosstab.columns[0]).toPandas().as_matrix()
)
If you really want Spark statistics:
from itertools import chain
Statistics.chiSqTest(DenseMatrix(
numRows=crosstab.count(), numCols=len(crosstab.columns) - 1,
values=list(chain(*zip(*crosstab.drop(crosstab.columns[0]).collect())))
))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With