When I do a data[genres].sum()
I get the following result
Action 1891
Adult 9
Adventure 1313
Animation 314
Biography 394
Comedy 3922
Crime 1867
Drama 5697
Family 754
Fantasy 916
Film-Noir 40
History 358
Horror 1215
Music 371
Musical 260
Mystery 1009
News 1
Reality-TV 1
Romance 2441
Sci-Fi 897
Sport 288
Thriller 2832
War 512
Western 235
dtype: int64
But when I try to sort on the sum using np.sort
genre_count = np.sort(data[genres].sum())[::-1]
pd.DataFrame({'Genre Count': genre_count})`
I get the following result
`Out[19]:
Genre Count
0 5697
1 3922
2 2832
3 2441
4 1891
5 1867
6 1313
7 1215
8 1009
9 916
10 897
11 754
12 512
13 394
14 371
15 358
16 314
17 288
18 260
19 235
20 40
21 9
22 1
23 1
The expected result should be like this:
Genre Count
Drama 5697
Comedy 3922
Thriller 2832
Romance 2441
Action 1891
Crime 1867
Adventure 1313
Horror 1215
Mystery 1009
Fantasy 916
Sci-Fi 897
Family 754
War 512
Biography 394
Music 371
History 358
Animation 314
Sport 288
Musical 260
Western 235
Film-Noir 40
Adult 9
News 1
Reality-TV 1
It seems like numpy is ignoring the genre column.
Could somebody help me understand where I am going wrong?
Python lists are better optimized for "plain Python" code: reading or writing to a list element is faster than it is for a NumPy array. The benefit of NumPy array comes from "whole array operations" (so called array operations) and from compiled extensions.
To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order. It does not modify the original DataFrame.
sort() function sorts the NumPy array in ascending order. Let's see how to sort NumPy arrays in descending order. By Sorting a NumPy array in descending order sorts the elements from largest to smallest value. You can use the syntax array[::-1] to reverse the array.
Pandas has a better performance when a number of rows is 500K or more. Numpy has a better performance when number of rows is 50K or less. Indexing of the pandas series is very slow as compared to numpy arrays.
data[genres].sum()
returns a Series. The genre column isn't actually a column - it's an index.
np.sort
just looks at the values of the DataFrame or Series, not at the index, and it returns a new NumPy array with the sorted data[genres].sum()
values. The index information is lost.
The way to sort data[genres].sum()
and keep the index information would be to do something like:
genre_count = data[genres].sum()
genre_count.sort(ascending=False) # in-place sort of genre_count, high to low
You can then turn the sorted genre_count
Series back into a DataFrame if you like:
pd.DataFrame({'Genre Count': genre_count})
data[genres].sum()
returns a Series.
And if you're using pandas version 0.2, the command have little bit changes.
genre_count = data[genres].sum()
genre_count.sort_values(ascending=False)`
You could find reference on pandas documentation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With