I have a DataFrame which contains a lot of NA values. I want to write a query which returns rows where a particular column is not NA but all other columns are NA.
I can get a Dataframe where all the column values are not NA easily enough:
df[df.interesting_column.notna()]
However, I cant figure out how to then say "from that DataFrame return only rows were every column that is not 'interesting_column' is NA". I can't use .dropna
as all rows and columns will contain at least one NA value.
I realise this is probably embarrassingly simple. I have tried lots of .loc
variations, join/merges in various configurations and I am not getting anywhere.
Any pointers before I just do a for loop over this thing would be appreciated.
You can simply use a conjunction of the conditions:
df[df.interesting_column.notna() & (df.isnull().sum(axis=1) == len(df.columns) - 1)]
df.interesting_column.notna()
checks the column is non-null.
df.isnull().sum(axis=1) == len(df.columns) - 1
checks that the number of nulls in the row is the number of columns minus 1
Both conditions together mean that the entry in the column is the only one that is non-null.
The &
operator lets you row-by-row "and" together two boolean columns. Right now, you are using df.interesting_column.notna()
to give you a column of TRUE
or FALSE
values. You could repeat this for all columns, using notna()
or isna()
as desired, and use the &
operator to combine the results.
For example, if you have columns a
, b
, and c
, and you want to find rows where the value in columns a
is not NaN
and the values in the other columns are NaN
, then do the following:
df[df.a.notna() & df.b.isna() & df.c.isna()]
This is clear and simple when you have a small number of columns that you know about ahead of time. But, if you have many columns, or if you don't know the column names, you would want a solution that loops over all columns and checks notna()
for the interesting_column
and isna()
for the other columns. The solution by @AmiTavory is a clever way to achieve this. But, if you didn't know about that solution, here is a simpler approach.
for colName in df.columns:
if colName == "interesting_column":
df = df[ df[colName].notna() ]
else:
df = df[ df[colName].isna() ]
You can use:
rows = df.drop('interesting_column', axis=1).isna().all(1) & df['interesting_column'].notna()
Example (suppose c
is the interesting column):
In [99]: df = pd.DataFrame({'a': [1, np.nan, 2], 'b': [1, np.nan, 3], 'c':[4, 5, np.nan]})
In [100]: df
Out[100]:
a b c
0 1.0 1.0 4.0
1 NaN NaN 5.0
2 2.0 3.0 NaN
In [101]: rows = df.drop('c', axis=1).isna().all(1) & df.c.notna()
In [102]: rows
Out[102]:
0 False
1 True
2 False
dtype: bool
In [103]: df[rows]
Out[103]:
a b c
1 NaN NaN 5.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With