Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find the top 5 values based on the sum in the last column and last row

I would like to find the highest and the lowest 5 values based on the sum of last column and last rows from a tableset which has more than 20,000 rows and 200 columns. (It is a multilabels problem). The original table does not have sum of columns and rows. I added the sum values by myself). See the toy dataset here:

 import pandas as pd

 data = {'index': ['0001 ','0002 ','0003 ','0004 ','0005 ','0006 
    ','0007','0008','0009','0010','0011'],
    'factor1': [0,1,0,1,0,0,1,0,0,0,1],
    'factor2': [1,0,0,1,0,0,0,1,1,1,1], 
    'factor3': [1,1,1,1,0,0,0,1,1,0,1],
    'factor4': [0,1,1,1,0,0,1,1,0,0,1],
    'factor5': [1,1,1,1,0,0,0,1,1,1,1], 
    'factor6': [1,0,0,0,0,0,0,1,1,1,1],
    'factor7': [0,1,1,1,1,0,1,1,0,0,1],
    'factor8': [1,1,1,1,1,1,0,1,1,1,1], 
    'factor9': [1,0,0,0,0,0,0,0,0,0,0],
    }

    df = pd.DataFrame(data,columns=['index','factor1','factor2','factor3','factor4','factor5','factor6','factor7','factor8','factor9'])
    count_row = df.count(axis=1)
    df

Here is the generated table:

index   factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 factor9
0   0001    0     1       1        0      1      1       0       1        1
1   0002    1     0       1        1      1      0       1       1        0
2   0003    0     0       1        1      1      0       1       1        0
3   0004    1     1       1        1      1      0       1       1        0
4   0005    0     0       0        0      0      0       1       1        0
5   0006    0     0       0        0      0      0       0       1        0 
6   0007    1     0       0        1      0      0       1       0        0
7   0008    0     1       1        1      1      1       1       1        0
8   0009    0     1       1        0      1      1       0       1        0
9   0010    0     1       0        0      1      1       0       1        0
10  0011    1     1       1        1      1      1       1       1        0

Using this code, I got the sum of each columns and each rows

classSum=df.sum(axis=0) 
df["sum"] =df.sum(axis=1)
df =df.append(classSum,ignore_index=True)
rowSum=df.sum(axis=1)
df.at[11,'index']='Nan'
df

Table with sums in columns and rows:

    index   factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 factor9 sum
  0  0001     0        1       1       0       1       1       0       1       1    6.0
  1  0002     1        0       1       1       1       0       1       1       0    6.0
  2  0003     0        0       1       1       1       0       1       1       0    5.0
  3  0004     1        1       1       1       1       0       1       1       0    7.0
  4  0005     0        0       0       0       0       0       1       1       0    2.0
  5  0006     0        0       0       0       0       0       0       1       0    1.0
  6  0007     1        0       0       1       0       0       1       0       0    3.0
  7  0008     0        1       1       1       1       1       1       1       0    7.0
  8  0009     0        1       1       0       1       1       0       1       0    5.0
  9  0010     0        1       0       0       1       1       0       1       0    4.0
  10 0011     1        1       1       1       1       1       1       1       0    8.0
  11 Nan      4        6       7       6       8       5       7       10      1    NaN

Note: row 11 is the sum row

I would like to have a result like this:

Based on rows: -The output of the top five values looks like this:

  factor 8 :10
  factor 5 : 8 
  factor 3 : 7
  factor 7 : 7
  factor 4 : 6

Based on columns:

-The output top 5 values looks like this:

 0011 :8
 0008 :7
 0004 :7
 0001 :6
 0002 :6

There are same values in the sum. Just ignore it.

So how can I do it? Thank you!

like image 571
almo Avatar asked Sep 14 '20 14:09

almo


People also ask

How do you calculate top 5 in Python?

Finding Top 5 maximum value for each group can also be achieved while doing the group by. The function that is helpful for finding the Top 5 maximum value is nlargest().

How do I get the last 5 rows of a data frame?

Method 1: Using tail() method DataFrame. tail(n) to get the last n rows of the DataFrame. It takes one optional argument n (number of rows you want to get from the end). By default n = 5, it return the last 5 rows if the value of n is not passed to the method.

Which function is used to pick last 5 rows in a series?

The tail() method returns the last n rows. By default, the last 5 rows are returned. You can specify the number of rows.


1 Answers

Starting with your raw data, so without the sum columns, we can use DataFrame.sum to get the sum per column or row (axis=1), then we chain the result with Series.nlargest to get the top 5.

df = df.set_index('index')

Top 5 columns:

df.sum().nlargest(5)

factor8    10
factor5     8
factor3     7
factor7     7
factor2     6
dtype: int64

Top 5 rows:

df.sum(axis=1).nlargest(5)

index
0011     8
0004     7
0008     7
0001     6
0002     6
dtype: int64

If you actually want dictionary's, chain the solutions with to_dict:

df.sum().nlargest(5).to_dict()

{'factor8': 10, 'factor5': 8, 'factor3': 7, 'factor7': 7, 'factor2': 6}

To plot your result, use DataFrame.plot.bar:

df.sum().nlargest(5).plot.bar(figsize=(12,8))

barplot

like image 146
Erfan Avatar answered Oct 22 '22 21:10

Erfan