I have a dataframe with a structure similar to the following:
col1, col2, col3, col4
A,A,A,A
A,B,C,D
B,C,A,D
A,C,A,D
A,F,A,A
A,V,B,A
What I want is to 'drop' the rows where conditions are met for all columns at the same time. For example, drop rows where col1 == A
and col2 == C
at the same time. Note that, in this case, the only row that should be dropped would be "A,C,A,D"
as it's the only one where both conditions are met at the same time. Hence, the dataframe should look like this:
col1, col2, col3, col4
A,A,A,A
A,B,C,D
B,C,A,D
A,F,A,A
A,V,B,A
What I've tried so far is:
# spark library import
import pyspark.sql.functions as F
df = df.filter(
((F.col("col1") != "A") & (F.col("col2") != "C"))
)
This one doesn't filter as I want, because it removes all rows where only one condition is met, likecol1 == "A"
or col2 == "C"
, returning:
col1, col2, col3, col4
B,C,A,D
Can anybody please help me out with this?
Thanks
Combine both conditions and do a NOT
:
cond = (F.col('col1') == 'A') & (F.col('col2') == 'C')
df.filter(~cond)
from pyspark.sql.functions import when
df.withColumn('Result',when(df.col1!='A',"True").when(df.col2!='C',"True")).filter("Result==True").drop("Result").show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With