I have a movie dataframe with movie names, their respective genre, and vector representation (numpy arrays). <pre class="prettyprint"><code>ID Year Title Genre Word Vector 1 2003.0 Dinosaur Planet Documentary [-0.55423898, -0.72544044, 0.33189204, -0.1720... 2 2004.0 Isle of Man TT 2004 Review Sports & Fitness [-0.373265237, -1.07549703, -0.469254494, -0.4... 3 1997.0 Character Foreign [-1.57682264, -0.91265768, 2.43038678, -0.2114... 4 1994.0 Paula Abdul's Get Up & Dance Sports & Fitness [0.3096168, -0.57186663, 0.39008939, 0.2868615... 5 2004.0 The Rise and Fall of ECW Sports & Fitness [0.17175879, -2.38005066, -0.45771399, 1.32608... </code></pre> I'd like to group by genre and get each genre's average vector representation (the component wise average of each movie vector in the genre). <hr> I first tried: <pre class="prettyprint"><code>movie_df.groupby(['Genre']).mean() </code></pre> But the built in mean function isn't able to take the mean of numpy arrays. I tried creating my own function to do so and then apply it to each group, but I'm not sure this is using apply correctly: <pre class="prettyprint"><code>def vector_average(group): series_to_array = np.array(group.tolist()) return np.mean(series_to_array, axis = 0) movie_df.groupby(['Genre']).apply(vector_average) </code></pre> Any pointers would be appreciated!

If I understand correctly, to get the component-wise averages you can simply apply <code>np.mean</code> to the <code>'Word Vector'</code> SeriesGroupBy explicitly. <pre class="prettyprint"><code>df.groupby('Genre')['Word Vector'].apply(np.mean) </code></pre> <hr> Demo <pre class="prettyprint"><code>>>> df = pd.DataFrame({'Title': list('ABCDEFGHIJ'), 'Genre': list('ABCBBDCDED'), 'Word Vector': [np.random.randint(0, 10, 10) for _ in range(len('ABCDEFGHIJ'))]}) >>> df Genre Title Word Vector 0 A A [3, 6, 8, 0, 4, 8, 1, 4, 0, 1] 1 B B [5, 4, 4, 4, 8, 7, 4, 3, 7, 2] 2 C C [1, 7, 6, 7, 3, 3, 8, 1, 8, 1] 3 B D [0, 4, 6, 7, 1, 5, 5, 0, 6, 7] 4 B E [8, 2, 1, 4, 1, 2, 0, 4, 9, 1] 5 D F [7, 9, 7, 8, 8, 7, 2, 9, 1, 3] 6 C G [0, 7, 1, 9, 6, 2, 1, 0, 3, 7] 7 D H [4, 7, 9, 4, 1, 5, 0, 3, 0, 6] 8 E I [5, 1, 5, 1, 8, 1, 1, 4, 5, 6] 9 D J [7, 9, 0, 1, 8, 3, 8, 8, 1, 0] >>> df.groupby('Genre')['Word Vector'].apply(np.mean) Genre A [3.0, 6.0, 8.0, 0.0, 4.0, 8.0, 1.0, 4.0, 0.0, ... B [4.33333333333, 3.33333333333, 3.66666666667, ... C [0.5, 7.0, 3.5, 8.0, 4.5, 2.5, 4.5, 0.5, 5.5, ... D [6.0, 8.33333333333, 5.33333333333, 4.33333333... E [5.0, 1.0, 5.0, 1.0, 8.0, 1.0, 1.0, 4.0, 5.0, ... Name: Word Vector, dtype: object </code></pre>

Group by column in pandas dataframe and average arrays

Tags:

python

arrays

pandas

numpy

mean

I have a movie dataframe with movie names, their respective genre, and vector representation (numpy arrays).

ID  Year    Title   Genre   Word Vector
1   2003.0  Dinosaur Planet Documentary [-0.55423898, -0.72544044, 0.33189204, -0.1720...
2   2004.0  Isle of Man TT 2004 Review  Sports & Fitness    [-0.373265237, -1.07549703, -0.469254494, -0.4...
3   1997.0  Character   Foreign [-1.57682264, -0.91265768, 2.43038678, -0.2114...
4   1994.0  Paula Abdul's Get Up & Dance    Sports & Fitness    [0.3096168, -0.57186663, 0.39008939, 0.2868615...
5   2004.0  The Rise and Fall of ECW    Sports & Fitness    [0.17175879, -2.38005066, -0.45771399, 1.32608...

I'd like to group by genre and get each genre's average vector representation (the component wise average of each movie vector in the genre).

I first tried:

movie_df.groupby(['Genre']).mean()

But the built in mean function isn't able to take the mean of numpy arrays.

I tried creating my own function to do so and then apply it to each group, but I'm not sure this is using apply correctly:

def vector_average(group):
   series_to_array = np.array(group.tolist())
   return np.mean(series_to_array, axis = 0)

movie_df.groupby(['Genre']).apply(vector_average)

Any pointers would be appreciated!

552

asked Aug 17 '17 04:08

Matt

1 Answers

If I understand correctly, to get the component-wise averages you can simply apply np.mean to the 'Word Vector' SeriesGroupBy explicitly.

df.groupby('Genre')['Word Vector'].apply(np.mean)

Demo

>>> df = pd.DataFrame({'Title': list('ABCDEFGHIJ'), 
                       'Genre': list('ABCBBDCDED'), 
                       'Word Vector': [np.random.randint(0, 10, 10) 
                                       for _ in range(len('ABCDEFGHIJ'))]})

>>> df

  Genre Title                     Word Vector
0     A     A  [3, 6, 8, 0, 4, 8, 1, 4, 0, 1]
1     B     B  [5, 4, 4, 4, 8, 7, 4, 3, 7, 2]
2     C     C  [1, 7, 6, 7, 3, 3, 8, 1, 8, 1]
3     B     D  [0, 4, 6, 7, 1, 5, 5, 0, 6, 7]
4     B     E  [8, 2, 1, 4, 1, 2, 0, 4, 9, 1]
5     D     F  [7, 9, 7, 8, 8, 7, 2, 9, 1, 3]
6     C     G  [0, 7, 1, 9, 6, 2, 1, 0, 3, 7]
7     D     H  [4, 7, 9, 4, 1, 5, 0, 3, 0, 6]
8     E     I  [5, 1, 5, 1, 8, 1, 1, 4, 5, 6]
9     D     J  [7, 9, 0, 1, 8, 3, 8, 8, 1, 0]

>>> df.groupby('Genre')['Word Vector'].apply(np.mean)

Genre
A    [3.0, 6.0, 8.0, 0.0, 4.0, 8.0, 1.0, 4.0, 0.0, ...
B    [4.33333333333, 3.33333333333, 3.66666666667, ...
C    [0.5, 7.0, 3.5, 8.0, 4.5, 2.5, 4.5, 0.5, 5.5, ...
D    [6.0, 8.33333333333, 5.33333333333, 4.33333333...
E    [5.0, 1.0, 5.0, 1.0, 8.0, 1.0, 1.0, 4.0, 5.0, ...
Name: Word Vector, dtype: object

168

answered Oct 11 '22 17:10

miradulo

Related questions
                            
                                Integration testing with Python unittest: how to improve granularity?
                            
                                UWSGI fails to install on Debian 9 (pip)
                            
                                django rest auth returns AnonymousUser
                            
                                Flask Unittest for Post Method
                            
                                OpenCV build from source Windows make error "RC Object"
                            
                                Openpyxl: How to copy a row after checking if a cell contains specific value
                            
                                Pandas read issue, 0xff in position 0
                            
                                Can I use insert() on an empty list in Python?
                            
                                How to get the position of the turtle?
                            
                                Why is this usage of python F-string interpolation wrapping with quotes?
                            
                                Multiprocessing with threading?
                            
                                What side should a django 'many-to-many' relationship reside on
                            
                                How to remove the Undo button in plotly dash after a dropdown update
                            
                                pyodbc Incorrect syntax near '-'. (102)
                            
                                How to redirect python script cmd output to a file?
                            
                                How to compare all columns with one column in pandas?
                            
                                unsupported operand type(s) for <<: 'str' and 'int' while reading file
                            
                                How to draw a frame on a matplotlib figure
                            
                                CFFI: TypeError: initializer for ctype 'char[]' must be a bytes or list or tuple, not str
                            
                                Send raw POST request using Socket

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With