Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Element-wise mean of lists in a dataframe conditioned on another column

I have a dataframe looking like this with three columns (10 different stimuli, 16 trials and a data column containing lists of equal lengths). I would simply like to get the element-wise mean of the data column based on the stimulus. As I have 10 different stimuli, it should result in 10 arrays for each stimulus which also are the mean of all data arrays over trials.

enter image description here

I thought about something like this but it gives me somewthing really weird.

df.groupby('stimulus').apply(np.mean)
>> IndexError: tuple index out of range

enter image description here


Build my dataframe

trial_vec       = np.tile(np.arange(16)+1, 10)     
stimulus_vec    = np.repeat([-2., -1.75, -1., -0.75, -0.5,  0.5,  1.,  1.25,  1.75,  2.5 ], 16)                  
data_vec        = np.random.randint(0, 16, size=160)
df              = pd.DataFrame({'trial': trial_vec, 'stimulus': stimulus_vec, 'data': data_vec}).astype('object')
df["data"]      = [np.random.rand(4).tolist() for i in range(160)]
df
like image 418
Svenno Nito Avatar asked Jan 29 '23 12:01

Svenno Nito


2 Answers

You can convert data in each group to a 2d list which makes sure the object can be converted to a 2d numpy array when the number of elements in each cell of the data column are the same, and then take mean over axis=0 (column-wise mean):

df.groupby('stimulus').data.apply(lambda g: np.mean(g.values.tolist(), axis=0))

#stimulus
#-2.00    [0.641834320107, 0.427639804593, 0.42733812964...
#-1.75    [0.622484839138, 0.529860126072, 0.63310754064...
#-1.00    [0.546323060494, 0.465573022088, 0.54947320390...
#-0.75    [0.431675052484, 0.367636755052, 0.45263194597...
#-0.50    [0.423135952819, 0.544110613089, 0.55496058720...
# 0.50    [0.421858616927, 0.439204977418, 0.43153540636...
# 1.00    [0.612239664017, 0.499305567037, 0.46284515082...
# 1.25    [0.498544756769, 0.481073640317, 0.43564801829...
# 1.75    [0.51821909334, 0.44904063908, 0.358509374567,...
# 2.50    [0.465606275355, 0.516448419224, 0.33715002349...
#Name: data, dtype: object

Or stack data as a 2d array, and then take mean over axis=0:

df.groupby('stimulus').data.apply(lambda g: np.mean(np.stack(g), axis=0))

Edit: if you have nans in the data column, you can use np.nanmean to calculate mean without nans:

df.groupby('stimulus').data.apply(lambda g: np.nanmean(np.stack(g), axis=0))
like image 89
Psidom Avatar answered Feb 14 '23 16:02

Psidom


An update

This is actually a rare usecase for a grouper not in the current DataFrame.

df['data'].apply(pd.Series).groupby(df['stimulus']).mean()

Original

I'm not sure what exactly you are trying to do but you typically should not have lists in your dataframe. I would properly format your data first and then take the mean of each column by group.

data_proper = df['data'].apply(pd.Series)
df_new = pd.concat([df.drop('data',axis=1), data_proper], axis=1)
df_new.head()

  stimulus trial         0         1         2         3
0       -2     1  0.046361  0.967723  0.707726  0.708462
1       -2     2  0.270566  0.778324  0.638878  0.276983
2       -2     3  0.261356  0.563411  0.639114  0.111150
3       -2     4  0.124745  0.532362  0.869781  0.142513
4       -2     5  0.707596  0.137417  0.493232  0.098975

df_new.groupby('stimulus').mean()

                 0         1         2         3
stimulus                                        
-2.00     0.516795  0.458579  0.527230  0.360560
-1.75     0.418950  0.497287  0.442577  0.518487
-1.00     0.569175  0.350724  0.429025  0.562950
-0.75     0.474533  0.517560  0.472101  0.658333
-0.50     0.481185  0.426829  0.414059  0.571252
 0.50     0.432719  0.563101  0.421617  0.531289
 1.00     0.478947  0.412383  0.458543  0.590503
 1.25     0.596648  0.520953  0.515184  0.513206
 1.75     0.492729  0.524673  0.567336  0.465172
 2.50     0.369798  0.540603  0.499210  0.605297

Or in one continuous line inspired by @Scott Boston

df.drop('data', axis=1)\
  .assign(**df.data.apply(pd.Series).add_prefix('col'))\
  .groupby('stimulus').mean()
like image 34
Ted Petrou Avatar answered Feb 14 '23 16:02

Ted Petrou