Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select rows where at least one value from the list of columns is not null

Tags:

python

pandas

I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.

So if I would know the number of my columns in advance, I could do something like this:

list_of_cols = ['col1', ...]
df[
  df[list_of_cols[0]].notnull() |
  df[list_of_cols[1]].notnull() |
  ...
  df[list_of_cols[6]].notnull() |
]

I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.

like image 449
Salvador Dali Avatar asked Aug 20 '16 04:08

Salvador Dali


3 Answers

Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.

df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))

df[list_of_cols].dropna(thresh=1).head()

enter image description here

like image 182
piRSquared Avatar answered Oct 19 '22 20:10

piRSquared


Starting with this:

data = {'a' :      [np.nan,0,0,0,0,0,np.nan,0,0,  0,0,0,    9,9,],
    'b' :      [np.nan,np.nan,1,1,1,1,1,1,1,  2,2,2,    1,7],
    'c' :      [np.nan,np.nan,1,1,2,2,3,3,3,  1,1,1,    1,1],
    'd' :      [np.nan,np.nan,7,9,6,9,7,np.nan,6,  6,7,6,    9,6]}

df = pd.DataFrame(data, columns=['a','b','c','d'])
df
      a    b    c    d
0   NaN  NaN  NaN  NaN
1   0.0  NaN  NaN  NaN
2   0.0  1.0  1.0  7.0
3   0.0  1.0  1.0  9.0
4   0.0  1.0  2.0  6.0
5   0.0  1.0  2.0  9.0
6   NaN  1.0  3.0  7.0
7   0.0  1.0  3.0  NaN
8   0.0  1.0  3.0  6.0
9   0.0  2.0  1.0  6.0
10  0.0  2.0  1.0  7.0
11  0.0  2.0  1.0  6.0
12  9.0  1.0  1.0  9.0
13  9.0  7.0  1.0  6.0

Rows where not all values are nulls. (Removing row index 0)

df[~df.isnull().all(axis=1)]

      a    b    c    d
1   0.0  NaN  NaN  NaN
2   0.0  1.0  1.0  7.0
3   0.0  1.0  1.0  9.0
4   0.0  1.0  2.0  6.0
5   0.0  1.0  2.0  9.0
6   NaN  1.0  3.0  7.0
7   0.0  1.0  3.0  NaN
8   0.0  1.0  3.0  6.0
9   0.0  2.0  1.0  6.0
10  0.0  2.0  1.0  7.0
11  0.0  2.0  1.0  6.0
12  9.0  1.0  1.0  9.0
13  9.0  7.0  1.0  6.0
like image 25
Merlin Avatar answered Oct 19 '22 18:10

Merlin


One can use boolean indexing

df[~pd.isnull(df[list_of_cols]).all(axis=1)]

Explanation:

The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:

  • isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise

  • all() returns True if all of the elements are True (row-wise axis=1)

So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.

An example:

Dataframe:

>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
                     'B':['x',np.NaN,np.NaN,'w'],
                     'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
    A    B           C
0  11    x  2016-03-13
1  22  NaN         NaN
2  33  NaN  2016-03-14
3 NaN    w  2016-03-15

isnull mask:

>>> ~pd.isnull(df[list_of_cols])
       B      C
0   True   True
1  False  False
2  False   True
3   True   True

apply all(axis=1) row-wise:

>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0     True
1    False
2     True
3     True
dtype: bool

Boolean selection from dataframe:

>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
    A    B           C
0  11    x  2016-03-13
2  33  NaN  2016-03-14
3 NaN    w  2016-03-15
like image 1
user2314737 Avatar answered Oct 19 '22 20:10

user2314737