Conditional column selection in pandas

Question

I want to select columns from a DataFrame according to a particular condition. I know it can be done with a loop, but my df is very large so efficiency is crucial. The condition for column selection is having either only non-nan entries or a sequence of only nans followed by a sequence of only non-nan entries.

Here is an example. Consider the following DataFrame:

pd.DataFrame([[1, np.nan, 2, np.nan], [2, np.nan, 5, np.nan], [4, 8, np.nan, 1], [3, 2, np.nan, 2], [3, 2, 5, np.nan]])

   0    1    2    3
0  1  NaN  2.0  NaN
1  2  NaN  5.0  NaN
2  4  8.0  NaN  1.0
3  3  2.0  NaN  2.0
4  3  2.0  5.0  NaN

From it, I would like to select only columns 0 and 1. Any advice on how to do this efficiently without looping?

piRSquared · Accepted Answer

logic

count the nulls in each column. if the only nulls are in the beginning, then the number of nulls in the column should be equal the the position of the first valid index.
get the first valid index
slice the index by the null count and compare against the first valid indices. If they are equal, then thats a good column

cnull = df.isnull().sum()
fvald = df.apply(pd.Series.first_valid_index)
cols = df.index[cnull] == fvald
df.loc[:, cols]

enter image description here

Edited with speed improvements

old answer

def pir1(df):
    cnull = df.isnull().sum()
    fvald = df.apply(pd.Series.first_valid_index)
    cols = df.index[cnull] == fvald
    return df.loc[:, cols]

much faster answer using same logic

def pir2(df):
    nulls = np.isnan(df.values)
    null_count = nulls.sum(0)
    first_valid = nulls.argmin(0)
    null_on_top = null_count == first_valid
    filtered_data = df.values[:, null_on_top]
    filtered_columns = df.columns.values[null_on_top]
    return pd.DataFrame(filtered_data, df.index, filtered_columns)

enter image description here

Nickil Maveli · Answer

Consider a DF as shown which has Nans in various possible locations:

1. Both sides Nans present:

Create a mask by replacing all nans with 0's and finite values with 1's:

mask = np.where(np.isnan(df), 0, 1)

Take it's corresponding element difference across each column. Next, take modulus of it's values. Logic here is that whenever there are three unique values in each column, then discard that column(namely → -1,1,0) as there would be a break in the sequence for such a situation.

Idea is to take the sum and create a subset wherever the sum results in a value less than 2.(As after taking mod, we get 1,1,0). So, for the extreme case, we get sum as 2 and those columns certainly are disjoint and must be discarded.

criteria = pd.DataFrame(mask, columns=df.columns).diff(1).abs().sum().lt(2)

Finally transpose the DF and use this condition and re-transpose to get the desired result having only Nans in one portion and finite values in the other.

df.loc[:, criteria]

2. Nans present on top:

mask = np.where(np.isnan(df), 0, 1)
criteria = pd.DataFrame(mask, columns=df.columns).diff(1).ne(-1).any()
df.loc[:, criteria]

Conditional column selection in pandas

Tags:

python-3.x

pandas

dataframe

splinter

2 Answers

piRSquared

Nickil Maveli

Recent Activity

Donate For Us

Conditional column selection in pandas

Tags:

python-3.x

pandas

dataframe

splinter

2 Answers

piRSquared

Nickil Maveli

Related questions

Recent Activity

Donate For Us