Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark DataFrame filter using logical AND over list of conditions -- Numpy All Equivalent

I'm trying to filter rows of a PySpark dataframe if the values of all columns are zero.

I was hoping to use something like this, (using the numpy function np.all() ):

from pyspark.sql.functions import col
df.filter(all([(col(c) != 0) for c in df.columns]))

But I get the ValueError:

ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

Is there any way to perform the logical and on a list of conditions? What is the corresponding np.all functionality in PySpark?

like image 770
MarkNS Avatar asked Dec 20 '16 10:12

MarkNS


1 Answers

Just reduce the list of predicates

from pyspark.sql.functions import lit
from operator import and_
from functools import reduce

df.where(reduce(and_, (col(c) != 0 for c in df.columns)))

or

df.where(reduce(and_, (col(c) != 0 for c in df.columns), lit(True)))

if you expect that the list of predicates might be empty.

For example if data looks like this:

df  = sc.parallelize([
    (0, 0, 0), (1, 0, 0),  (0, 1, 0), (0, 0, 1), (1, 1, 1)
]).toDF(["x", "y", "z"])

the result will be:

+---+---+---+
|  x|  y|  z|
+---+---+---+
|  1|  1|  1|
+---+---+---+
like image 164
zero323 Avatar answered Nov 11 '22 23:11

zero323