Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Show a dataframe with all rows that have null values

I am new to pyspark and using Dataframes what I am trying to do is get the subset of all the columns with Null value(s).

Most examples I see online show me a filter function on a specific column. Is it possible to filter the entire data frame and show all the rows that contain at least 1 null value?

like image 977
Murtaza Mohsin Avatar asked Sep 16 '25 01:09

Murtaza Mohsin


2 Answers

If you don't care about which columns are null, you can use a loop to create a filtering condition:

from pyspark.sql import SparkSession
from pyspark.sql import functions as func

q1_df = spark\
    .createDataFrame([(None, 1, 2), (3, None, 4), (5, 6, None), (7, 8, 9)],
                     ['a', 'b', 'c'])
q1_df.show(5, False)
+----+----+----+
|a   |b   |c   |
+----+----+----+
|null|1   |2   |
|3   |null|4   |
|5   |6   |null|
|7   |8   |9   |
+----+----+----+


condition = (func.lit(False)) 
for col in q1_df.columns:
    condition = condition | (func.col(col).isNull())
q1_df.filter(condition).show(3, False)
+----+----+----+
|a   |b   |c   |
+----+----+----+
|null|1   |2   |
|3   |null|4   |
|5   |6   |null|
+----+----+----+

As you're finding the row that any one column is null, you can use the OR condition.


Edit on: 2022-08-01

The reason why I first declare condition as func.lit(False) is just for the simplification of my coding, just want to create a "base" condition. In fact, this filter doesn't have any usage in this filtering. When you check the condition, you will see:

Column<'(((false OR (a IS NULL)) OR (b IS NULL)) OR (c IS NULL))'>

In fact you can use other method to create the condition. For example:

for idx, col in enumerate(q1_df.columns):
    if idx == 0:
        condition = (func.col(col).isNull())
    else:
        condition = condition | (func.col(col).isNull())

condition
Column<'(((a IS NULL) OR (b IS NULL)) OR (c IS NULL))'>

Alternatively, if you want to filter out the row that BOTH not null in all columns, in my coding, I would:

condition = (func.lit(True)) 
for col in q1_df.columns:
    condition = condition & (func.col(col).isNotNull())

As long as you can create all the filtering condition, you can eliminate the func.lit(False). Just to remind that if you create the "base" condition like me, please don't use the python built-in bool type like below since they are not the same type (boolean vs spark column):

condition = False

for col in q1_df.columns:
    condition = condition | (func.col(col).isNull())
like image 170
Jonathan Avatar answered Sep 18 '25 14:09

Jonathan


Try this. df[columns] obtains the name of all columns. The last line returns all rows that contains at least one null across the columns. The code should also still work if you replace any 'None' in 'data' with np.NaN

import pandas as pd
data = {'a':[10,20,30,40],
        'b':['a',None,'b','c'],
        'c':[None,'b','c','d']
}
df = pd.DataFrame(data)
print(df)
print()
print(df[df[df.columns].isnull().any(1)])

Output

    a     b     c
0  10     a  None
1  20  None     b
2  30     b     c
3  40     c     d

    a     b     c
0  10     a  None
1  20  None     b
like image 35
DrCorgi Avatar answered Sep 18 '25 15:09

DrCorgi