Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check whether boolean column contains only True values

Working in Databricks, I've got a dataframe which looks like this:

columns = ["a", "b", "c"]
data = [(True, True, True), (True, True, True), (True, False, True)]
df = spark.createDataFrame(data).toDF(*columns)
df.display()

enter image description here

I'd like to select only those columns of the dataframe in which not all values are True.
In pandas, I would use df['a'].all() to check whether all values of column "a" are True. Unfortunately, I don't find an equivalent in PySpark. I have found a solution for the problem, but it seems much too complicated:

df.select(*[column for column in df.columns 
            if df.select(column).distinct().collect() != 
            spark.createDataFrame([True], 'boolean').toDF(column).collect()])

The solution returns what I want:

enter image description here

Is there an easier way of doing this in PySpark?

like image 726
the_economist Avatar asked Sep 13 '25 08:09

the_economist


1 Answers

Here is one approach which follows a similar operational model to that of pandas

def is_all(c):
    return (F.sum(F.col(c).astype('int')) == F.count(c)).alias(c)

# create a boolean mask, 
# for e.g in pandas this would similar to df.all(axis=0)
mask = df.agg(*[is_all(c) for c in columns]).collect()[0]

# Use the boolean mask to filter the columns
result = df.select(*[c for c in columns if not mask[c]])

Result

+-----+
|    b|
+-----+
| true|
| true|
|false|
+-----+
like image 50
Shubham Sharma Avatar answered Sep 14 '25 21:09

Shubham Sharma