Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Find rows where a particular column is not NA but all other columns are

Tags:

python

pandas

I have a DataFrame which contains a lot of NA values. I want to write a query which returns rows where a particular column is not NA but all other columns are NA.

I can get a Dataframe where all the column values are not NA easily enough:

df[df.interesting_column.notna()]

However, I cant figure out how to then say "from that DataFrame return only rows were every column that is not 'interesting_column' is NA". I can't use .dropna as all rows and columns will contain at least one NA value.

I realise this is probably embarrassingly simple. I have tried lots of .loc variations, join/merges in various configurations and I am not getting anywhere.

Any pointers before I just do a for loop over this thing would be appreciated.

like image 586
Tom Cooper Avatar asked May 17 '18 17:05

Tom Cooper


3 Answers

You can simply use a conjunction of the conditions:

df[df.interesting_column.notna() & (df.isnull().sum(axis=1) == len(df.columns) - 1)]
  • df.interesting_column.notna() checks the column is non-null.

  • df.isnull().sum(axis=1) == len(df.columns) - 1 checks that the number of nulls in the row is the number of columns minus 1

Both conditions together mean that the entry in the column is the only one that is non-null.

like image 158
Ami Tavory Avatar answered Nov 17 '22 00:11

Ami Tavory


The & operator lets you row-by-row "and" together two boolean columns. Right now, you are using df.interesting_column.notna() to give you a column of TRUE or FALSE values. You could repeat this for all columns, using notna() or isna() as desired, and use the & operator to combine the results.

For example, if you have columns a, b, and c, and you want to find rows where the value in columns a is not NaN and the values in the other columns are NaN, then do the following:

df[df.a.notna() & df.b.isna() & df.c.isna()]

This is clear and simple when you have a small number of columns that you know about ahead of time. But, if you have many columns, or if you don't know the column names, you would want a solution that loops over all columns and checks notna() for the interesting_column and isna() for the other columns. The solution by @AmiTavory is a clever way to achieve this. But, if you didn't know about that solution, here is a simpler approach.

for colName in df.columns:
    if colName == "interesting_column":
        df = df[ df[colName].notna() ]
    else:
        df = df[ df[colName].isna() ]
like image 28
Tim Johns Avatar answered Nov 17 '22 01:11

Tim Johns


You can use:

rows = df.drop('interesting_column', axis=1).isna().all(1) & df['interesting_column'].notna()

Example (suppose c is the interesting column):

In [99]: df = pd.DataFrame({'a': [1, np.nan, 2], 'b': [1, np.nan, 3], 'c':[4, 5, np.nan]})

In [100]: df
Out[100]: 
     a    b    c
0  1.0  1.0  4.0
1  NaN  NaN  5.0
2  2.0  3.0  NaN

In [101]: rows = df.drop('c', axis=1).isna().all(1) & df.c.notna()

In [102]: rows
Out[102]: 
0    False
1     True
2    False
dtype: bool

In [103]: df[rows]
Out[103]: 
    a   b    c
1 NaN NaN  5.0
like image 1
llllllllll Avatar answered Nov 17 '22 02:11

llllllllll