I have a Pandas Dataframe with at least 4 non-NaN values on each row, but located at different columns:
Index Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
1991-12-31 100.000 100.000 100.000 89.123 NaN NaN NaN NaN
1992-01-31 98.300 101.530 100.000 NaN 92.342 NaN NaN NaN
1992-02-29 NaN 100.230 98.713 97.602 NaN NaN NaN NaN
1992-03-31 NaN NaN 102.060 93.473 98.123 NaN NaN NaN
1992-04-30 NaN 102.205 107.755 94.529 94.529 NaN NaN NaN
(I show only the first 8 columns) I would like to turn this into a Data frame with 4 columns for each row. The rows should contain only the first four (reading from left to right) non NaN values for that date.
The order on each row matters.
Approach #1 : Here's a NumPy solution using justify
-
pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])
Sample run -
In [211]: df
Out[211]:
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8
Index
1991-12-31 100.0 100.000 100.000 89.123 NaN NaN NaN NaN
1992-01-31 98.3 101.530 100.000 NaN 92.342 NaN NaN NaN
1992-02-29 NaN 100.230 98.713 97.602 NaN NaN NaN NaN
1992-03-31 NaN NaN 102.060 93.473 98.123 NaN NaN NaN
1992-04-30 NaN 102.205 107.755 94.529 94.529 NaN NaN NaN
In [212]: pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])
Out[212]:
0 1 2 3
0 100.000 100.000 100.000 89.123
1 98.300 101.530 100.000 92.342
2 100.230 98.713 97.602 NaN
3 102.060 93.473 98.123 NaN
4 102.205 107.755 94.529 94.529
Approach #2 : Using tailor-made function for masks -
def app2(df, N=4):
a = df.values
out = np.empty_like(a)
mask = df.isnull().values
mask_sorted = np.sort(mask,1)
out[~mask_sorted] = a[~mask]
return pd.DataFrame(out[:,:N])
Runtime test for working solutions that keep order -
# Using df from posted question to recreate a bigger one :
df = df.set_index('Index')
df = pd.concat([df] * 10000, ignore_index=1)
In [298]: %timeit app2(df)
100 loops, best of 3: 4.06 ms per loop
In [299]: %timeit pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=1, side='left')[:,:4])
100 loops, best of 3: 4.78 ms per loop
In [300]: %timeit df.apply(sorted, key=np.isnan, axis=1).iloc[:, :4]
1 loop, best of 3: 4.05 s per loop
If order isn't important, you can call np.sort
along the first axis.
df = df.set_index('Index') # ignore if `Index` already is the index
pd.DataFrame(np.sort(df.values, axis=1)[:, :4],
columns=np.arange(1, 5)).add_prefix('Col')
Col1 Col2 Col3 Col4
0 89.123 100.000 100.000 100.000
1 92.342 98.300 100.000 101.530
2 97.602 98.713 100.230 NaN
3 93.473 98.123 102.060 NaN
4 94.529 94.529 102.205 107.755
This is much faster than my second solution, so if this is possible, definitely consider this.
If order matters, call sorted
+ apply
and take the first 4 columns of your result.
df.apply(sorted, key=np.isnan, axis=1).iloc[:, :4]
Col1 Col2 Col3 Col4
Index
1991-12-31 100.000 100.000 100.000 89.123
1992-01-31 98.300 101.530 100.000 92.342
1992-02-29 100.230 98.713 97.602 NaN
1992-03-31 102.060 93.473 98.123 NaN
1992-04-30 102.205 107.755 94.529 94.529
Timings
Here are timings for just my answers -
df = pd.concat([df] * 10000, ignore_index=1)
%timeit df.apply(sorted, key=np.isnan, axis=1).iloc[:, :4]
1 loop, best of 3: 8.45 s per loop
pd.DataFrame(np.sort(df.values, axis=1)[:, :4],
columns=np.arange(1, 5)).add_prefix('Col')
100 loops, best of 3: 4.76 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With