I want to select columns from a DataFrame according to a particular condition. I know it can be done with a loop, but my df is very large so efficiency is crucial. The condition for column selection is having either only non-nan entries or a sequence of only nans followed by a sequence of only non-nan entries.
Here is an example. Consider the following DataFrame:
pd.DataFrame([[1, np.nan, 2, np.nan], [2, np.nan, 5, np.nan], [4, 8, np.nan, 1], [3, 2, np.nan, 2], [3, 2, 5, np.nan]])
0 1 2 3
0 1 NaN 2.0 NaN
1 2 NaN 5.0 NaN
2 4 8.0 NaN 1.0
3 3 2.0 NaN 2.0
4 3 2.0 5.0 NaN
From it, I would like to select only columns 0 and 1. Any advice on how to do this efficiently without looping?
logic
cnull = df.isnull().sum()
fvald = df.apply(pd.Series.first_valid_index)
cols = df.index[cnull] == fvald
df.loc[:, cols]

Edited with speed improvements
old answer
def pir1(df):
cnull = df.isnull().sum()
fvald = df.apply(pd.Series.first_valid_index)
cols = df.index[cnull] == fvald
return df.loc[:, cols]
much faster answer using same logic
def pir2(df):
nulls = np.isnan(df.values)
null_count = nulls.sum(0)
first_valid = nulls.argmin(0)
null_on_top = null_count == first_valid
filtered_data = df.values[:, null_on_top]
filtered_columns = df.columns.values[null_on_top]
return pd.DataFrame(filtered_data, df.index, filtered_columns)

Consider a DF as shown which has Nans in various possible locations:

1. Both sides Nans present:
Create a mask by replacing all nans with 0's and finite values with 1's:
mask = np.where(np.isnan(df), 0, 1)
Take it's corresponding element difference across each column. Next, take modulus of it's values. Logic here is that whenever there are three unique values in each column, then discard that column(namely → -1,1,0) as there would be a break in the sequence for such a situation.
Idea is to take the sum and create a subset wherever the sum results in a value less than 2.(As after taking mod, we get 1,1,0). So, for the extreme case, we get sum as 2 and those columns certainly are disjoint and must be discarded.
criteria = pd.DataFrame(mask, columns=df.columns).diff(1).abs().sum().lt(2)
Finally transpose the DF and use this condition and re-transpose to get the desired result having only Nans in one portion and finite values in the other.
df.loc[:, criteria]

2. Nans present on top:
mask = np.where(np.isnan(df), 0, 1)
criteria = pd.DataFrame(mask, columns=df.columns).diff(1).ne(-1).any()
df.loc[:, criteria]

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With