I have some data from an experiment, and within each trial there are some single values, surrounded by NA
's, that I want to fill out to the entire trial:
df = pd.DataFrame({'trial': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'cs_name': [np.nan, 'A1', np.nan, np.nan, np.nan, np.nan, 'B2',
np.nan, 'A1', np.nan, np.nan, np.nan]})
Out[177]:
cs_name trial
0 NaN 1
1 A1 1
2 NaN 1
3 NaN 1
4 NaN 2
5 NaN 2
6 B2 2
7 NaN 2
8 A1 3
9 NaN 3
10 NaN 3
11 NaN 3
I'm able to fill these values within the whole trial by using both bfill()
and ffill()
, but I'm wondering if there is a better way to achieve this.
df['cs_name'] = df.groupby('trial')['cs_name'].ffill()
df['cs_name'] = df.groupby('trial')['cs_name'].bfill()
Expected output:
cs_name trial
0 A1 1
1 A1 1
2 A1 1
3 A1 1
4 B2 2
5 B2 2
6 B2 2
7 B2 2
8 A1 3
9 A1 3
10 A1 3
11 A1 3
Example 1: Filling missing columns values with fixed values: We can use fillna() function to impute the missing values of a data frame to every column defined by a dictionary of values.
You can use fillna() function to fill missing values with default value that you want. e.g: If df1 is your dataframe containing missing values in multiple columns. You can also use pandas isna() function to check where values are missing.
Missing values can also be imputed using interpolation. Pandas interpolate method can be used to replace the missing values with different interpolation methods like 'polynomial', 'linear', 'quadratic'. Default method is 'linear'.
An alternative approach is to use first_valid_index
and a transform
:
In [11]: g = df.groupby('trial')
In [12]: g['cs_name'].transform(lambda s: s.loc[s.first_valid_index()])
Out[12]:
0 A1
1 A1
2 A1
3 A1
4 B2
5 B2
6 B2
7 B2
8 A1
9 A1
10 A1
11 A1
Name: cs_name, dtype: object
This ought to be more efficient then using ffill followed by a bfill...
And use this to change the cs_name
column:
df['cs_name'] = g['cs_name'].transform(lambda s: s.loc[s.first_valid_index()])
Note: I think it would be nice enhancement to have a method to grab the first non-null object in the pandas, in numpy it's an open request, I don't think there is currently a method (I could be wrong!)...
If you want to avoid the error that appears when some groups contain only NaN you could do the following (Note that I changed the df so there are only Nan for the group having trial=1):
df = pd.DataFrame({'trial': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3,1,1],
'cs_name': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'B2', np.nan,
'A3', np.nan, np.nan, np.nan, np.nan,np.nan]})
g = data.groupby('trial')
g['cs_name'].transform(lambda s: 'No values to aggregate' if
pd.isnull(s).all() == True else s.loc[s.first_valid_index()])
df['cs_name'] = g['cs_name'].transform(lambda s: 'No values to aggregate' if
pd.isnull(s).all() == True else s.loc[s.first_valid_index()])`
This way you input 'No Values to aggregate' (or whatever you want) when the program finds all NaN for a particular group, instead of an error.
Hope this helps :)
Federico
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With