I have a dataframe with multiple categorical columns. I'm trying to find the the chisquared statistics using the in-built function between two columns:
from pyspark.ml.stat import ChiSquareTest
r = ChiSquareTest.test(df, 'feature1', 'feature2')
However, it gives me the error:
IllegalArgumentException: 'requirement failed: Column feature1 must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'
The datatype for feature1
is:
feature1: double (nullable = true)
Could you please help me with this regard?
spark-ml
is not the typical statistics library. It is very ML oriented. Therefore it assumes that you will want to run a test between a label and a feature or a group of features.
Therefore, similarly to when you train a model, you need to assemble the features you want to test against the label.
In your case, you can just assemble feature1
as follows:
from pyspark.ml.stat import ChiSquareTest
from pyspark.ml.feature import VectorAssembler
data = [(1, 2), (3, 4), (2, 1), (4, 3)]
df = spark.createDataFrame(data, ['feature1', 'feature2'])
assembler = VectorAssembler().setInputCols(['feature1']).setOutputCol('features')
ChiSquareTest.test(assembler.transform(df), 'features', 'feature2').show(false)
Just in case, the code in scala:
import org.apache.spark.ml.stat.ChiSquareTest
import org.apache.spark.ml.feature.VectorAssembler
val df = Seq((1, 2, 3), (1, 2, 3), (4, 5, 6), (6, 5, 4))
.toDF("features", "feature2", "feature3")
val assembler = new VectorAssembler()
.setInputCols(Array("feature1"))
.setOutputCol("features")
ChiSquareTest.test(assembler.transform(df), "features", "feature2").show(false)
To expand on Oli's answer, Spark ML expects features to be stored in instances of pyspark.ml.linalg.Vector
. There are two kinds of vectors:
array<T>
size
that indicates the full dimension of the vectorindices
array that holds the positions of the non-zero elementsvalues
array that holds the values of the non-zero elementsBoth vector types are actually represented using the structure for sparse vectors, whereas for dense vectors the indices
array goes unused and values
stores all of the values. The first structure element, type
, is used to distinguish between the two kinds.
So, if you see an error that something expects struct<type:tinyint,size:int,indices:array<int>,values:array<double>>
, that means you are supposed to pass an instances of pyspark.ml.linagl.Vector
and not mere numbers.
In order to produce Vector
s, you can either use pyspark.ml.feature.VectorAssembler
to assemble one or more independent feature columns into a single vector column or manually construct them using the factory methods Vectors.dense()
(for dense vectors) and Vectors.sparse()
(for sparse vectors) of the factory object pyspark.ml.linalg.Vectors
. Using VectorAssembler
is probably easier and also faster since it's implemented in Scala. For use of explicit vector creation, consult the example for ChiSquareTest
in the PySpark documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With