I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df
, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh
parameter in the dropna()
method. By setting thresh=1
, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1)
returns a boolean array that is applied as a filter to the dataframe:
isnull()
applied to df[list_of_cols]
creates a boolean mask for the dataframe df[list_of_cols]
with True
values for the null elements in df[list_of_cols]
, False
otherwise
all()
returns True
if all of the elements are True
(row-wise axis=1
)
So, by negation ~
(not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull
mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1)
row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With