I would like to find the highest and the lowest 5 values based on the sum of last column and last rows from a tableset which has more than 20,000 rows and 200 columns. (It is a multilabels problem). The original table does not have sum of columns and rows. I added the sum values by myself). See the toy dataset here: <pre class="prettyprint"><code> import pandas as pd data = {'index': ['0001 ','0002 ','0003 ','0004 ','0005 ','0006 ','0007','0008','0009','0010','0011'], 'factor1': [0,1,0,1,0,0,1,0,0,0,1], 'factor2': [1,0,0,1,0,0,0,1,1,1,1], 'factor3': [1,1,1,1,0,0,0,1,1,0,1], 'factor4': [0,1,1,1,0,0,1,1,0,0,1], 'factor5': [1,1,1,1,0,0,0,1,1,1,1], 'factor6': [1,0,0,0,0,0,0,1,1,1,1], 'factor7': [0,1,1,1,1,0,1,1,0,0,1], 'factor8': [1,1,1,1,1,1,0,1,1,1,1], 'factor9': [1,0,0,0,0,0,0,0,0,0,0], } df = pd.DataFrame(data,columns=['index','factor1','factor2','factor3','factor4','factor5','factor6','factor7','factor8','factor9']) count_row = df.count(axis=1) df </code></pre> Here is the generated table: <pre class="prettyprint"><code>index factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 factor9 0 0001 0 1 1 0 1 1 0 1 1 1 0002 1 0 1 1 1 0 1 1 0 2 0003 0 0 1 1 1 0 1 1 0 3 0004 1 1 1 1 1 0 1 1 0 4 0005 0 0 0 0 0 0 1 1 0 5 0006 0 0 0 0 0 0 0 1 0 6 0007 1 0 0 1 0 0 1 0 0 7 0008 0 1 1 1 1 1 1 1 0 8 0009 0 1 1 0 1 1 0 1 0 9 0010 0 1 0 0 1 1 0 1 0 10 0011 1 1 1 1 1 1 1 1 0 </code></pre> Using this code, I got the sum of each columns and each rows <pre class="prettyprint"><code>classSum=df.sum(axis=0) df["sum"] =df.sum(axis=1) df =df.append(classSum,ignore_index=True) rowSum=df.sum(axis=1) df.at[11,'index']='Nan' df </code></pre> Table with sums in columns and rows: <pre class="prettyprint"><code> index factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 factor9 sum 0 0001 0 1 1 0 1 1 0 1 1 6.0 1 0002 1 0 1 1 1 0 1 1 0 6.0 2 0003 0 0 1 1 1 0 1 1 0 5.0 3 0004 1 1 1 1 1 0 1 1 0 7.0 4 0005 0 0 0 0 0 0 1 1 0 2.0 5 0006 0 0 0 0 0 0 0 1 0 1.0 6 0007 1 0 0 1 0 0 1 0 0 3.0 7 0008 0 1 1 1 1 1 1 1 0 7.0 8 0009 0 1 1 0 1 1 0 1 0 5.0 9 0010 0 1 0 0 1 1 0 1 0 4.0 10 0011 1 1 1 1 1 1 1 1 0 8.0 11 Nan 4 6 7 6 8 5 7 10 1 NaN </code></pre> Note: row 11 is the sum row I would like to have a result like this: Based on rows: -The output of the top five values looks like this: <pre class="prettyprint"><code> factor 8 :10 factor 5 : 8 factor 3 : 7 factor 7 : 7 factor 4 : 6 </code></pre> Based on columns: -The output top 5 values looks like this: <pre class="prettyprint"><code> 0011 :8 0008 :7 0004 :7 0001 :6 0002 :6 </code></pre> There are same values in the sum. Just ignore it. So how can I do it? Thank you!

Starting with your raw data, so without the sum columns, we can use <code>DataFrame.sum</code> to get the sum per column or row (<code>axis=1</code>), then we chain the result with <code>Series.nlargest</code> to get the top 5. <pre class="prettyprint"><code>df = df.set_index('index') </code></pre> Top 5 columns: <pre class="prettyprint"><code>df.sum().nlargest(5) factor8 10 factor5 8 factor3 7 factor7 7 factor2 6 dtype: int64 </code></pre> Top 5 rows: <pre class="prettyprint"><code>df.sum(axis=1).nlargest(5) index 0011 8 0004 7 0008 7 0001 6 0002 6 dtype: int64 </code></pre> <hr> If you actually want dictionary's, chain the solutions with <code>to_dict</code>: <pre class="prettyprint"><code>df.sum().nlargest(5).to_dict() {'factor8': 10, 'factor5': 8, 'factor3': 7, 'factor7': 7, 'factor2': 6} </code></pre> <hr> To plot your result, use <code>DataFrame.plot.bar</code>: <pre class="prettyprint"><code>df.sum().nlargest(5).plot.bar(figsize=(12,8)) </code></pre> <img src="https://i.stack.imgur.com/q8ORj.png" alt="barplot">

Find the top 5 values based on the sum in the last column and last row

Tags:

python

pandas

dataframe

I would like to find the highest and the lowest 5 values based on the sum of last column and last rows from a tableset which has more than 20,000 rows and 200 columns. (It is a multilabels problem). The original table does not have sum of columns and rows. I added the sum values by myself). See the toy dataset here:

 import pandas as pd

 data = {'index': ['0001 ','0002 ','0003 ','0004 ','0005 ','0006 
    ','0007','0008','0009','0010','0011'],
    'factor1': [0,1,0,1,0,0,1,0,0,0,1],
    'factor2': [1,0,0,1,0,0,0,1,1,1,1], 
    'factor3': [1,1,1,1,0,0,0,1,1,0,1],
    'factor4': [0,1,1,1,0,0,1,1,0,0,1],
    'factor5': [1,1,1,1,0,0,0,1,1,1,1], 
    'factor6': [1,0,0,0,0,0,0,1,1,1,1],
    'factor7': [0,1,1,1,1,0,1,1,0,0,1],
    'factor8': [1,1,1,1,1,1,0,1,1,1,1], 
    'factor9': [1,0,0,0,0,0,0,0,0,0,0],
    }

    df = pd.DataFrame(data,columns=['index','factor1','factor2','factor3','factor4','factor5','factor6','factor7','factor8','factor9'])
    count_row = df.count(axis=1)
    df

Here is the generated table:

index   factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 factor9
0   0001    0     1       1        0      1      1       0       1        1
1   0002    1     0       1        1      1      0       1       1        0
2   0003    0     0       1        1      1      0       1       1        0
3   0004    1     1       1        1      1      0       1       1        0
4   0005    0     0       0        0      0      0       1       1        0
5   0006    0     0       0        0      0      0       0       1        0 
6   0007    1     0       0        1      0      0       1       0        0
7   0008    0     1       1        1      1      1       1       1        0
8   0009    0     1       1        0      1      1       0       1        0
9   0010    0     1       0        0      1      1       0       1        0
10  0011    1     1       1        1      1      1       1       1        0

Using this code, I got the sum of each columns and each rows

classSum=df.sum(axis=0) 
df["sum"] =df.sum(axis=1)
df =df.append(classSum,ignore_index=True)
rowSum=df.sum(axis=1)
df.at[11,'index']='Nan'
df

Table with sums in columns and rows:

    index   factor1 factor2 factor3 factor4 factor5 factor6 factor7 factor8 factor9 sum
  0  0001     0        1       1       0       1       1       0       1       1    6.0
  1  0002     1        0       1       1       1       0       1       1       0    6.0
  2  0003     0        0       1       1       1       0       1       1       0    5.0
  3  0004     1        1       1       1       1       0       1       1       0    7.0
  4  0005     0        0       0       0       0       0       1       1       0    2.0
  5  0006     0        0       0       0       0       0       0       1       0    1.0
  6  0007     1        0       0       1       0       0       1       0       0    3.0
  7  0008     0        1       1       1       1       1       1       1       0    7.0
  8  0009     0        1       1       0       1       1       0       1       0    5.0
  9  0010     0        1       0       0       1       1       0       1       0    4.0
  10 0011     1        1       1       1       1       1       1       1       0    8.0
  11 Nan      4        6       7       6       8       5       7       10      1    NaN

Note: row 11 is the sum row

I would like to have a result like this:

Based on rows: -The output of the top five values looks like this:

  factor 8 :10
  factor 5 : 8 
  factor 3 : 7
  factor 7 : 7
  factor 4 : 6

Based on columns:

-The output top 5 values looks like this:

There are same values in the sum. Just ignore it.

So how can I do it? Thank you!

571

asked Sep 14 '20 14:09

almo

1 Answers

Starting with your raw data, so without the sum columns, we can use DataFrame.sum to get the sum per column or row (axis=1), then we chain the result with Series.nlargest to get the top 5.

df = df.set_index('index')

Top 5 columns:

df.sum().nlargest(5)

factor8    10
factor5     8
factor3     7
factor7     7
factor2     6
dtype: int64

Top 5 rows:

df.sum(axis=1).nlargest(5)

index
0011     8
0004     7
0008     7
0001     6
0002     6
dtype: int64

If you actually want dictionary's, chain the solutions with to_dict:

df.sum().nlargest(5).to_dict()

{'factor8': 10, 'factor5': 8, 'factor3': 7, 'factor7': 7, 'factor2': 6}

To plot your result, use DataFrame.plot.bar:

df.sum().nlargest(5).plot.bar(figsize=(12,8))

barplot

146

answered Oct 22 '22 21:10

Erfan

Related questions
                            
                                Convert csv into tsv using pandas with escapechar
                            
                                SpyderKernelApp WARNING No such comm
                            
                                Does Ansible expose its auto-discovered Python interpreter?
                            
                                Can you run Google Colab on your local computer?
                            
                                Graphing points on a map but the error code is "ValueError: 'box_aspect' and 'fig_aspect' must be positive"
                            
                                How can I extract text fragments from PDF with their coordinates in Python?
                            
                                "WHY" 2 different executables of python of same version?
                            
                                Verify hostname of the server who invoked the API
                            
                                How determine if a token is part of an entity within Spacy?
                            
                                Conditional filtering of ndarrays
                            
                                Python Callback for File Object Close
                            
                                AttributeError: 'Worksheet' object has no attribute 'set_column'
                            
                                selenium.common.exceptions.SessionNotCreatedException: Message: session not created: This version of ChromeDriver only supports Chrome version 85
                            
                                Parse expression with binary and unary operators, reserved words, and without parentheses
                            
                                "requests.exceptions.ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))"
                            
                                How to clear the conda environment variables?
                            
                                Pandas: Sampling from a DataFrame according to a target distribution
                            
                                Fastest way to run a single function in python in parallel for multiple parameters
                            
                                Return majority weighted vote from array based in columns
                            
                                Add file filters to JavaFx Filechooser in Jython and parametrize them

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With