I have a movie dataframe with movie names, their respective genre, and vector representation (numpy arrays).
ID Year Title Genre Word Vector
1 2003.0 Dinosaur Planet Documentary [-0.55423898, -0.72544044, 0.33189204, -0.1720...
2 2004.0 Isle of Man TT 2004 Review Sports & Fitness [-0.373265237, -1.07549703, -0.469254494, -0.4...
3 1997.0 Character Foreign [-1.57682264, -0.91265768, 2.43038678, -0.2114...
4 1994.0 Paula Abdul's Get Up & Dance Sports & Fitness [0.3096168, -0.57186663, 0.39008939, 0.2868615...
5 2004.0 The Rise and Fall of ECW Sports & Fitness [0.17175879, -2.38005066, -0.45771399, 1.32608...
I'd like to group by genre and get each genre's average vector representation (the component wise average of each movie vector in the genre).
I first tried:
movie_df.groupby(['Genre']).mean()
But the built in mean function isn't able to take the mean of numpy arrays.
I tried creating my own function to do so and then apply it to each group, but I'm not sure this is using apply correctly:
def vector_average(group):
series_to_array = np.array(group.tolist())
return np.mean(series_to_array, axis = 0)
movie_df.groupby(['Genre']).apply(vector_average)
Any pointers would be appreciated!
Pandas Groupby Mean To get the average (or mean) value of in each group, you can directly apply the pandas mean() function to the selected columns from the result of pandas groupby.
To calculate mean values grouped on another column in pandas, we will use groupby, and then we will apply mean() method. Pandas allow us a direct method called mean() which calculates the average of the set passed into it.
To get column average or mean from pandas DataFrame use either mean() and describe() method. The DataFrame. mean() method is used to return the mean of the values for the requested axis.
Grouping by Multiple ColumnsYou can do this by passing a list of column names to groupby instead of a single string value.
If I understand correctly, to get the component-wise averages you can simply apply np.mean
to the 'Word Vector'
SeriesGroupBy explicitly.
df.groupby('Genre')['Word Vector'].apply(np.mean)
Demo
>>> df = pd.DataFrame({'Title': list('ABCDEFGHIJ'),
'Genre': list('ABCBBDCDED'),
'Word Vector': [np.random.randint(0, 10, 10)
for _ in range(len('ABCDEFGHIJ'))]})
>>> df
Genre Title Word Vector
0 A A [3, 6, 8, 0, 4, 8, 1, 4, 0, 1]
1 B B [5, 4, 4, 4, 8, 7, 4, 3, 7, 2]
2 C C [1, 7, 6, 7, 3, 3, 8, 1, 8, 1]
3 B D [0, 4, 6, 7, 1, 5, 5, 0, 6, 7]
4 B E [8, 2, 1, 4, 1, 2, 0, 4, 9, 1]
5 D F [7, 9, 7, 8, 8, 7, 2, 9, 1, 3]
6 C G [0, 7, 1, 9, 6, 2, 1, 0, 3, 7]
7 D H [4, 7, 9, 4, 1, 5, 0, 3, 0, 6]
8 E I [5, 1, 5, 1, 8, 1, 1, 4, 5, 6]
9 D J [7, 9, 0, 1, 8, 3, 8, 8, 1, 0]
>>> df.groupby('Genre')['Word Vector'].apply(np.mean)
Genre
A [3.0, 6.0, 8.0, 0.0, 4.0, 8.0, 1.0, 4.0, 0.0, ...
B [4.33333333333, 3.33333333333, 3.66666666667, ...
C [0.5, 7.0, 3.5, 8.0, 4.5, 2.5, 4.5, 0.5, 5.5, ...
D [6.0, 8.33333333333, 5.33333333333, 4.33333333...
E [5.0, 1.0, 5.0, 1.0, 8.0, 1.0, 1.0, 4.0, 5.0, ...
Name: Word Vector, dtype: object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With