Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas: Filling missing values within a group

Tags:

python

pandas

I have some data from an experiment, and within each trial there are some single values, surrounded by NA's, that I want to fill out to the entire trial:

df = pd.DataFrame({'trial': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], 
    'cs_name': [np.nan, 'A1', np.nan, np.nan, np.nan, np.nan, 'B2', 
                np.nan, 'A1', np.nan, np.nan, np.nan]})
Out[177]: 
   cs_name  trial
0      NaN      1
1       A1      1
2      NaN      1
3      NaN      1
4      NaN      2
5      NaN      2
6       B2      2
7      NaN      2
8       A1      3
9      NaN      3
10     NaN      3
11     NaN      3

I'm able to fill these values within the whole trial by using both bfill() and ffill(), but I'm wondering if there is a better way to achieve this.

df['cs_name'] = df.groupby('trial')['cs_name'].ffill()
df['cs_name'] = df.groupby('trial')['cs_name'].bfill()

Expected output:

   cs_name  trial
0       A1      1
1       A1      1
2       A1      1
3       A1      1
4       B2      2
5       B2      2
6       B2      2
7       B2      2
8       A1      3
9       A1      3
10      A1      3
11      A1      3
like image 986
Marius Avatar asked Aug 16 '13 04:08

Marius


People also ask

How do I fill NULL values in multiple columns in pandas?

Example 1: Filling missing columns values with fixed values: We can use fillna() function to impute the missing values of a data frame to every column defined by a dictionary of values.

How do you fill a missing value in a list Python?

You can use fillna() function to fill missing values with default value that you want. e.g: If df1 is your dataframe containing missing values in multiple columns. You can also use pandas isna() function to check where values are missing.

How do you replace missing values in a data set?

Missing values can also be imputed using interpolation. Pandas interpolate method can be used to replace the missing values with different interpolation methods like 'polynomial', 'linear', 'quadratic'. Default method is 'linear'.


2 Answers

An alternative approach is to use first_valid_index and a transform:

In [11]: g = df.groupby('trial')

In [12]: g['cs_name'].transform(lambda s: s.loc[s.first_valid_index()])
Out[12]: 
0     A1
1     A1
2     A1
3     A1
4     B2
5     B2
6     B2
7     B2
8     A1
9     A1
10    A1
11    A1
Name: cs_name, dtype: object

This ought to be more efficient then using ffill followed by a bfill...

And use this to change the cs_name column:

df['cs_name'] = g['cs_name'].transform(lambda s: s.loc[s.first_valid_index()])

Note: I think it would be nice enhancement to have a method to grab the first non-null object in the pandas, in numpy it's an open request, I don't think there is currently a method (I could be wrong!)...

like image 194
Andy Hayden Avatar answered Oct 16 '22 17:10

Andy Hayden


If you want to avoid the error that appears when some groups contain only NaN you could do the following (Note that I changed the df so there are only Nan for the group having trial=1):

df = pd.DataFrame({'trial': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3,1,1], 
'cs_name': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, 'B2', np.nan, 
'A3', np.nan, np.nan, np.nan, np.nan,np.nan]})

g = data.groupby('trial')

g['cs_name'].transform(lambda s: 'No values to aggregate' if 
    pd.isnull(s).all() == True else s.loc[s.first_valid_index()])

df['cs_name'] = g['cs_name'].transform(lambda s: 'No values to aggregate' if 
    pd.isnull(s).all() == True else s.loc[s.first_valid_index()])`

This way you input 'No Values to aggregate' (or whatever you want) when the program finds all NaN for a particular group, instead of an error.

Hope this helps :)

Federico

like image 42
Federico De Cillia Avatar answered Oct 16 '22 17:10

Federico De Cillia