When I do a <code>data[genres].sum()</code> I get the following result <pre class="prettyprint"><code>Action 1891 Adult 9 Adventure 1313 Animation 314 Biography 394 Comedy 3922 Crime 1867 Drama 5697 Family 754 Fantasy 916 Film-Noir 40 History 358 Horror 1215 Music 371 Musical 260 Mystery 1009 News 1 Reality-TV 1 Romance 2441 Sci-Fi 897 Sport 288 Thriller 2832 War 512 Western 235 dtype: int64 </code></pre> But when I try to sort on the sum using <code>np.sort</code> <pre class="prettyprint"><code>genre_count = np.sort(data[genres].sum())[::-1] pd.DataFrame({'Genre Count': genre_count})` </code></pre> I get the following result <pre class="prettyprint"><code>`Out[19]: Genre Count 0 5697 1 3922 2 2832 3 2441 4 1891 5 1867 6 1313 7 1215 8 1009 9 916 10 897 11 754 12 512 13 394 14 371 15 358 16 314 17 288 18 260 19 235 20 40 21 9 22 1 23 1 </code></pre> The expected result should be like this: <pre class="prettyprint"><code>Genre Count Drama 5697 Comedy 3922 Thriller 2832 Romance 2441 Action 1891 Crime 1867 Adventure 1313 Horror 1215 Mystery 1009 Fantasy 916 Sci-Fi 897 Family 754 War 512 Biography 394 Music 371 History 358 Animation 314 Sport 288 Musical 260 Western 235 Film-Noir 40 Adult 9 News 1 Reality-TV 1 </code></pre> It seems like numpy is ignoring the genre column. Could somebody help me understand where I am going wrong?

<code>data[genres].sum()</code> returns a Series. The genre column isn't actually a column - it's an index. <code>np.sort</code> just looks at the values of the DataFrame or Series, not at the index, and it returns a new NumPy array with the sorted <code>data[genres].sum()</code> values. The index information is lost. The way to sort <code>data[genres].sum()</code> and keep the index information would be to do something like: <pre class="prettyprint"><code>genre_count = data[genres].sum() genre_count.sort(ascending=False) # in-place sort of genre_count, high to low </code></pre> You can then turn the sorted <code>genre_count</code> Series back into a DataFrame if you like: <pre class="prettyprint"><code>pd.DataFrame({'Genre Count': genre_count}) </code></pre>

<code>data[genres].sum()</code> returns a Series. And if you're using pandas version 0.2, the command have little bit changes. <pre class="prettyprint"><code> genre_count = data[genres].sum() genre_count.sort_values(ascending=False)` </code></pre> You could find reference on pandas documentation

numpy sort acting weirdly when sorting on a pandas DataFrame

Tags:

python

sorting

pandas

dataframe

numpy

When I do a data[genres].sum() I get the following result

Action        1891
Adult            9
Adventure     1313
Animation      314
Biography      394
Comedy        3922
Crime         1867
Drama         5697
Family         754
Fantasy        916
Film-Noir       40
History        358
Horror        1215
Music          371
Musical        260
Mystery       1009
News             1
Reality-TV       1
Romance       2441
Sci-Fi         897
Sport          288
Thriller      2832
War            512
Western        235
dtype: int64

But when I try to sort on the sum using np.sort

genre_count = np.sort(data[genres].sum())[::-1]
pd.DataFrame({'Genre Count': genre_count})`

I get the following result

`Out[19]:
    Genre Count
0   5697
1   3922
2   2832
3   2441
4   1891
5   1867
6   1313
7   1215
8   1009
9   916
10  897
11  754
12  512
13  394
14  371
15  358
16  314
17  288
18  260
19  235
20  40
21  9
22  1
23  1

The expected result should be like this:

Genre Count
Drama   5697
Comedy  3922
Thriller    2832
Romance     2441
Action  1891
Crime   1867
Adventure   1313
Horror  1215
Mystery     1009
Fantasy     916
Sci-Fi  897
Family  754
War     512
Biography   394
Music   371
History     358
Animation   314
Sport   288
Musical     260
Western     235
Film-Noir   40
Adult   9
News    1
Reality-TV  1

It seems like numpy is ignoring the genre column.

Could somebody help me understand where I am going wrong?

813

asked Mar 15 '15 10:03

function

2 Answers

data[genres].sum() returns a Series. The genre column isn't actually a column - it's an index.

np.sort just looks at the values of the DataFrame or Series, not at the index, and it returns a new NumPy array with the sorted data[genres].sum() values. The index information is lost.

The way to sort data[genres].sum() and keep the index information would be to do something like:

genre_count = data[genres].sum()
genre_count.sort(ascending=False) # in-place sort of genre_count, high to low

You can then turn the sorted genre_count Series back into a DataFrame if you like:

pd.DataFrame({'Genre Count': genre_count})

168

answered Oct 13 '22 03:10

Alex Riley

data[genres].sum() returns a Series.

And if you're using pandas version 0.2, the command have little bit changes.

    genre_count = data[genres].sum()
    genre_count.sort_values(ascending=False)`

You could find reference on pandas documentation

answered Oct 13 '22 01:10

Yugo Gautomo

Related questions
                            
                                How multiarray.correlate2(a, v, mode) is actually implemented?
                            
                                IPython notebook interactive function: how to set the slider range
                            
                                Numpy repeat array along new axis
                            
                                Replacing named capturing groups with re.sub
                            
                                Haar Training: error (-215)_img.row * _img.cols == vecSize in function
                            
                                Add module inside cuckoo sandbox
                            
                                Convert and pad a list to numpy array
                            
                                How to make Menu.add_command() work in tkinter on the Mac?
                            
                                Pandas backwards compatibility issue with pickle 0.14.1 and 0.15.2
                            
                                Having trouble implementing a readlink() function
                            
                                Are Mixin classes abstract base classes
                            
                                Why does pandas.DataFrame.update change the dtypes of the updated dataframe?
                            
                                python module not working in PyCharm with virtualenv
                            
                                How to read HDF5 files that have only datasets (no groups) using h5py?
                            
                                Apply a Python function to an std::vector via Cython (callback)
                            
                                Extending threading.Timer for returning value from function gives TypeError
                            
                                Compressing request body with python-requests?
                            
                                Editing workbooks with rich text in openpyxl
                            
                                What is the best practice for storing UI messaging strings in Python/Django?
                            
                                Embedding multiple gridspec layouts on a single matplotlib figure?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With