I'm trying to filter rows of a PySpark dataframe if the values of all columns are zero.
I was hoping to use something like this, (using the numpy function np.all()
):
from pyspark.sql.functions import col
df.filter(all([(col(c) != 0) for c in df.columns]))
But I get the ValueError:
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
Is there any way to perform the logical and on a list of conditions? What is the corresponding np.all
functionality in PySpark?
Just reduce
the list of predicates
from pyspark.sql.functions import lit
from operator import and_
from functools import reduce
df.where(reduce(and_, (col(c) != 0 for c in df.columns)))
or
df.where(reduce(and_, (col(c) != 0 for c in df.columns), lit(True)))
if you expect that the list of predicates might be empty.
For example if data looks like this:
df = sc.parallelize([
(0, 0, 0), (1, 0, 0), (0, 1, 0), (0, 0, 1), (1, 1, 1)
]).toDF(["x", "y", "z"])
the result will be:
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 1| 1|
+---+---+---+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With