I am new to pyspark and using Dataframes what I am trying to do is get the subset of all the columns with Null value(s).
Most examples I see online show me a filter function on a specific column. Is it possible to filter the entire data frame and show all the rows that contain at least 1 null value?
If you don't care about which columns are null, you can use a loop to create a filtering condition:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
q1_df = spark\
.createDataFrame([(None, 1, 2), (3, None, 4), (5, 6, None), (7, 8, 9)],
['a', 'b', 'c'])
q1_df.show(5, False)
+----+----+----+
|a |b |c |
+----+----+----+
|null|1 |2 |
|3 |null|4 |
|5 |6 |null|
|7 |8 |9 |
+----+----+----+
condition = (func.lit(False))
for col in q1_df.columns:
condition = condition | (func.col(col).isNull())
q1_df.filter(condition).show(3, False)
+----+----+----+
|a |b |c |
+----+----+----+
|null|1 |2 |
|3 |null|4 |
|5 |6 |null|
+----+----+----+
As you're finding the row that any one column is null, you can use the OR condition.
Edit on: 2022-08-01
The reason why I first declare condition as func.lit(False)
is just for the simplification of my coding, just want to create a "base" condition. In fact, this filter doesn't have any usage in this filtering. When you check the condition
, you will see:
Column<'(((false OR (a IS NULL)) OR (b IS NULL)) OR (c IS NULL))'>
In fact you can use other method to create the condition. For example:
for idx, col in enumerate(q1_df.columns):
if idx == 0:
condition = (func.col(col).isNull())
else:
condition = condition | (func.col(col).isNull())
condition
Column<'(((a IS NULL) OR (b IS NULL)) OR (c IS NULL))'>
Alternatively, if you want to filter out the row that BOTH not null in all columns, in my coding, I would:
condition = (func.lit(True))
for col in q1_df.columns:
condition = condition & (func.col(col).isNotNull())
As long as you can create all the filtering condition, you can eliminate the func.lit(False)
. Just to remind that if you create the "base" condition like me, please don't use the python built-in bool type like below since they are not the same type (boolean
vs spark column
):
condition = False
for col in q1_df.columns:
condition = condition | (func.col(col).isNull())
Try this. df[columns] obtains the name of all columns. The last line returns all rows that contains at least one null across the columns. The code should also still work if you replace any 'None' in 'data' with np.NaN
import pandas as pd
data = {'a':[10,20,30,40],
'b':['a',None,'b','c'],
'c':[None,'b','c','d']
}
df = pd.DataFrame(data)
print(df)
print()
print(df[df[df.columns].isnull().any(1)])
Output
a b c
0 10 a None
1 20 None b
2 30 b c
3 40 c d
a b c
0 10 a None
1 20 None b
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With