Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

numpy sort acting weirdly when sorting on a pandas DataFrame

When I do a data[genres].sum() I get the following result

Action        1891
Adult            9
Adventure     1313
Animation      314
Biography      394
Comedy        3922
Crime         1867
Drama         5697
Family         754
Fantasy        916
Film-Noir       40
History        358
Horror        1215
Music          371
Musical        260
Mystery       1009
News             1
Reality-TV       1
Romance       2441
Sci-Fi         897
Sport          288
Thriller      2832
War            512
Western        235
dtype: int64

But when I try to sort on the sum using np.sort

genre_count = np.sort(data[genres].sum())[::-1]
pd.DataFrame({'Genre Count': genre_count})`

I get the following result

`Out[19]:
    Genre Count
0   5697
1   3922
2   2832
3   2441
4   1891
5   1867
6   1313
7   1215
8   1009
9   916
10  897
11  754
12  512
13  394
14  371
15  358
16  314
17  288
18  260
19  235
20  40
21  9
22  1
23  1

The expected result should be like this:

Genre Count
Drama   5697
Comedy  3922
Thriller    2832
Romance     2441
Action  1891
Crime   1867
Adventure   1313
Horror  1215
Mystery     1009
Fantasy     916
Sci-Fi  897
Family  754
War     512
Biography   394
Music   371
History     358
Animation   314
Sport   288
Musical     260
Western     235
Film-Noir   40
Adult   9
News    1
Reality-TV  1

It seems like numpy is ignoring the genre column.

Could somebody help me understand where I am going wrong?

like image 813
function Avatar asked Mar 15 '15 10:03

function


People also ask

Is NumPy sort faster than Python sort?

Python lists are better optimized for "plain Python" code: reading or writing to a list element is faster than it is for a NumPy array. The benefit of NumPy array comes from "whole array operations" (so called array operations) and from compiled extensions.

What is the correct way to sort a DataFrame?

To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order. It does not modify the original DataFrame.

Is NumPy sort ascending or descending?

sort() function sorts the NumPy array in ascending order. Let's see how to sort NumPy arrays in descending order. By Sorting a NumPy array in descending order sorts the elements from largest to smallest value. You can use the syntax array[::-1] to reverse the array.

Is Panda better than NumPy?

Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.


2 Answers

data[genres].sum() returns a Series. The genre column isn't actually a column - it's an index.

np.sort just looks at the values of the DataFrame or Series, not at the index, and it returns a new NumPy array with the sorted data[genres].sum() values. The index information is lost.

The way to sort data[genres].sum() and keep the index information would be to do something like:

genre_count = data[genres].sum()
genre_count.sort(ascending=False) # in-place sort of genre_count, high to low

You can then turn the sorted genre_count Series back into a DataFrame if you like:

pd.DataFrame({'Genre Count': genre_count})
like image 168
Alex Riley Avatar answered Oct 13 '22 03:10

Alex Riley


data[genres].sum() returns a Series.

And if you're using pandas version 0.2, the command have little bit changes.

    genre_count = data[genres].sum()
    genre_count.sort_values(ascending=False)`

You could find reference on pandas documentation

like image 42
Yugo Gautomo Avatar answered Oct 13 '22 01:10

Yugo Gautomo