I have a dataframe looking like this with three columns (10 different stimuli, 16 trials and a data column containing lists of equal lengths). I would simply like to get the element-wise mean of the data column based on the stimulus. As I have 10 different stimuli, it should result in 10 arrays for each stimulus which also are the mean of all data arrays over trials.
I thought about something like this but it gives me somewthing really weird.
df.groupby('stimulus').apply(np.mean)
>> IndexError: tuple index out of range
trial_vec = np.tile(np.arange(16)+1, 10)
stimulus_vec = np.repeat([-2., -1.75, -1., -0.75, -0.5, 0.5, 1., 1.25, 1.75, 2.5 ], 16)
data_vec = np.random.randint(0, 16, size=160)
df = pd.DataFrame({'trial': trial_vec, 'stimulus': stimulus_vec, 'data': data_vec}).astype('object')
df["data"] = [np.random.rand(4).tolist() for i in range(160)]
df
You can convert data
in each group to a 2d list which makes sure the object can be converted to a 2d numpy array when the number of elements in each cell of the data column are the same, and then take mean
over axis=0
(column-wise mean):
df.groupby('stimulus').data.apply(lambda g: np.mean(g.values.tolist(), axis=0))
#stimulus
#-2.00 [0.641834320107, 0.427639804593, 0.42733812964...
#-1.75 [0.622484839138, 0.529860126072, 0.63310754064...
#-1.00 [0.546323060494, 0.465573022088, 0.54947320390...
#-0.75 [0.431675052484, 0.367636755052, 0.45263194597...
#-0.50 [0.423135952819, 0.544110613089, 0.55496058720...
# 0.50 [0.421858616927, 0.439204977418, 0.43153540636...
# 1.00 [0.612239664017, 0.499305567037, 0.46284515082...
# 1.25 [0.498544756769, 0.481073640317, 0.43564801829...
# 1.75 [0.51821909334, 0.44904063908, 0.358509374567,...
# 2.50 [0.465606275355, 0.516448419224, 0.33715002349...
#Name: data, dtype: object
Or stack
data as a 2d array, and then take mean
over axis=0
:
df.groupby('stimulus').data.apply(lambda g: np.mean(np.stack(g), axis=0))
Edit: if you have nan
s in the data column, you can use np.nanmean
to calculate mean
without nan
s:
df.groupby('stimulus').data.apply(lambda g: np.nanmean(np.stack(g), axis=0))
This is actually a rare usecase for a grouper not in the current DataFrame.
df['data'].apply(pd.Series).groupby(df['stimulus']).mean()
I'm not sure what exactly you are trying to do but you typically should not have lists in your dataframe. I would properly format your data first and then take the mean of each column by group.
data_proper = df['data'].apply(pd.Series)
df_new = pd.concat([df.drop('data',axis=1), data_proper], axis=1)
df_new.head()
stimulus trial 0 1 2 3
0 -2 1 0.046361 0.967723 0.707726 0.708462
1 -2 2 0.270566 0.778324 0.638878 0.276983
2 -2 3 0.261356 0.563411 0.639114 0.111150
3 -2 4 0.124745 0.532362 0.869781 0.142513
4 -2 5 0.707596 0.137417 0.493232 0.098975
df_new.groupby('stimulus').mean()
0 1 2 3
stimulus
-2.00 0.516795 0.458579 0.527230 0.360560
-1.75 0.418950 0.497287 0.442577 0.518487
-1.00 0.569175 0.350724 0.429025 0.562950
-0.75 0.474533 0.517560 0.472101 0.658333
-0.50 0.481185 0.426829 0.414059 0.571252
0.50 0.432719 0.563101 0.421617 0.531289
1.00 0.478947 0.412383 0.458543 0.590503
1.25 0.596648 0.520953 0.515184 0.513206
1.75 0.492729 0.524673 0.567336 0.465172
2.50 0.369798 0.540603 0.499210 0.605297
Or in one continuous line inspired by @Scott Boston
df.drop('data', axis=1)\
.assign(**df.data.apply(pd.Series).add_prefix('col'))\
.groupby('stimulus').mean()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With