Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

IllegalArgumentException: Column must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'

I have a dataframe with multiple categorical columns. I'm trying to find the the chisquared statistics using the in-built function between two columns:

from pyspark.ml.stat import ChiSquareTest

r = ChiSquareTest.test(df, 'feature1', 'feature2')

However, it gives me the error:

IllegalArgumentException: 'requirement failed: Column feature1 must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'

The datatype for feature1 is:

feature1: double (nullable = true)

Could you please help me with this regard?

like image 532
Pratham Solanki Avatar asked Apr 06 '20 08:04

Pratham Solanki


2 Answers

spark-ml is not the typical statistics library. It is very ML oriented. Therefore it assumes that you will want to run a test between a label and a feature or a group of features.

Therefore, similarly to when you train a model, you need to assemble the features you want to test against the label.

In your case, you can just assemble feature1 as follows:

from pyspark.ml.stat import ChiSquareTest
from pyspark.ml.feature import VectorAssembler

data = [(1, 2), (3, 4), (2, 1), (4, 3)]
df = spark.createDataFrame(data, ['feature1', 'feature2'])
assembler = VectorAssembler().setInputCols(['feature1']).setOutputCol('features')

ChiSquareTest.test(assembler.transform(df), 'features', 'feature2').show(false)

Just in case, the code in scala:

import org.apache.spark.ml.stat.ChiSquareTest
import org.apache.spark.ml.feature.VectorAssembler

val df = Seq((1, 2, 3), (1, 2, 3), (4, 5, 6), (6, 5, 4))
    .toDF("features", "feature2", "feature3")
val assembler = new VectorAssembler()
    .setInputCols(Array("feature1"))
    .setOutputCol("features")

ChiSquareTest.test(assembler.transform(df), "features", "feature2").show(false)
like image 78
Oli Avatar answered Oct 23 '22 14:10

Oli


To expand on Oli's answer, Spark ML expects features to be stored in instances of pyspark.ml.linalg.Vector. There are two kinds of vectors:

  • dense vectors - those are simply arrays that hold all elements of the vector, including all zeros, and are represented by a Spark array type array<T>
  • sparse vectors - those are more complex data structures that only store non-zero elements of a vector, allowing compact storage of huge vectors with just a small number of non-zeros. Sparse vectors have three components:
    • an integer size that indicates the full dimension of the vector
    • an indices array that holds the positions of the non-zero elements
    • a values array that holds the values of the non-zero elements

Both vector types are actually represented using the structure for sparse vectors, whereas for dense vectors the indices array goes unused and values stores all of the values. The first structure element, type, is used to distinguish between the two kinds.

So, if you see an error that something expects struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, that means you are supposed to pass an instances of pyspark.ml.linagl.Vector and not mere numbers.

In order to produce Vectors, you can either use pyspark.ml.feature.VectorAssembler to assemble one or more independent feature columns into a single vector column or manually construct them using the factory methods Vectors.dense() (for dense vectors) and Vectors.sparse() (for sparse vectors) of the factory object pyspark.ml.linalg.Vectors. Using VectorAssembler is probably easier and also faster since it's implemented in Scala. For use of explicit vector creation, consult the example for ChiSquareTest in the PySpark documentation.

like image 35
Hristo Iliev Avatar answered Oct 23 '22 16:10

Hristo Iliev