given a data frame with one descriptive column and X numeric columns, for each row I'd like to identify the top N columns with the higher values and save it as rows on a new dataframe. For example, consider the following data frame: <pre class="prettyprint"><code>df = pd.DataFrame() df['index'] = ['A', 'B', 'C', 'D','E', 'F'] df['option1'] = [1,5,3,7,9,3] df['option2'] = [8,4,5,6,9,2] df['option3'] = [9,9,1,3,9,5] df['option4'] = [3,8,3,5,7,0] df['option5'] = [2,3,4,9,4,2] </code></pre> <img src="https://i.stack.imgur.com/9aqbI.png" alt="enter image description here"> I'd like to output (lets say N is 3, so I want the top 3): <pre class="prettyprint"><code>A,option3 A,option2 A,option4 B,option3 B,option4 B,option1 C,option2 C,option5 C,option4 (or option1 - ties arent really a problem) D,option5 D,option1 D,option2 and so on.... </code></pre> any idea how that can be easily achieved? Thanks

Let's assume <pre class="prettyprint"><code>N = 3 </code></pre> First of all I will create matrix of input fields and for each field remember what was original option for this cell: <pre class="prettyprint"><code>matrix = [[(j, 'option' + str(i)) for j in df['option' + str(i)]] for i in range(1,6)] </code></pre> The result of this line will be: <pre class="prettyprint"><code>[ [(1, 'option1'), (5, 'option1'), (3, 'option1'), (7, 'option1'), (9, 'option1'), (3, 'option1')], [(8, 'option2'), (4, 'option2'), (5, 'option2'), (6, 'option2'), (9, 'option2'), (2, 'option2')], [(9, 'option3'), (9, 'option3'), (1, 'option3'), (3, 'option3'), (9, 'option3'), (5, 'option3')], [(3, 'option4'), (8, 'option4'), (3, 'option4'), (5, 'option4'), (7, 'option4'), (0, 'option4')], [(2, 'option5'), (3, 'option5'), (4, 'option5'), (9, 'option5'), (4, 'option5'), (2, 'option5')] ] </code></pre> Then we can easly transform matrix using zip function, sort result rows by first element of tuple and take N first items: <pre class="prettyprint"><code>transformed = [sorted(l, key=lambda x: x[0], reverse=True)[:N] for l in zip(*matrix)] </code></pre> List transformed will look like: <pre class="prettyprint"><code>[ [(9, 'option3'), (8, 'option2'), (3, 'option4')], [(9, 'option3'), (8, 'option4'), (5, 'option1')], [(5, 'option2'), (4, 'option5'), (3, 'option1')], [(9, 'option5'), (7, 'option1'), (6, 'option2')], [(9, 'option1'), (9, 'option2'), (9, 'option3')], [(5, 'option3'), (3, 'option1'), (2, 'option2')] ] </code></pre> The last step will be joining column index and result tuple by: <pre class="prettyprint"><code>for id, top in zip(df['index'], transformed): for option in top: print id + ',' + option[1] print '' </code></pre>

Finding top N columns for each row in data frame

Tags:

python

pandas

dataframe

python-2.7

top-n

given a data frame with one descriptive column and X numeric columns, for each row I'd like to identify the top N columns with the higher values and save it as rows on a new dataframe.

For example, consider the following data frame:

df = pd.DataFrame()
df['index'] = ['A', 'B', 'C', 'D','E', 'F']
df['option1'] = [1,5,3,7,9,3]
df['option2'] = [8,4,5,6,9,2]
df['option3'] = [9,9,1,3,9,5]
df['option4'] = [3,8,3,5,7,0]
df['option5'] = [2,3,4,9,4,2]

enter image description here

I'd like to output (lets say N is 3, so I want the top 3):

A,option3
A,option2
A,option4

B,option3
B,option4
B,option1

C,option2
C,option5
C,option4 (or option1 - ties arent really a problem)

D,option5
D,option1
D,option2

and so on....

any idea how that can be easily achieved? Thanks

862

asked Dec 15 '15 19:12

Diego

3 Answers

If you just want pairings:

from operator import itemgetter as it
from itertools import repeat
n = 3

 # sort_values = order pandas < 0.17
new_d = (zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems())))
                 for _, row in df.iterrows())
for row in new_d:
    print(list(row))

Output:

[('B', 'option3'), ('B', 'option4'), ('B', 'option1')]
[('C', 'option2'), ('C', 'option5'), ('C', 'option1')]
[('D', 'option5'), ('D', 'option1'), ('D', 'option2')]
[('E', 'option1'), ('E', 'option2'), ('E', 'option3')]
[('F', 'option3'), ('F', 'option1'), ('F', 'option2')]

Which also maintains the order.

If you want a list of lists:

from operator import itemgetter as it
from itertools import repeat
n = 3

new_d = [list(zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems()))))
                 for _, row in df.iterrows()]

Output:

[[('A', 'option3'), ('A', 'option2'), ('A', 'option4')],
[('B', 'option3'), ('B', 'option4'), ('B', 'option1')], 
[('C', 'option2'), ('C', 'option5'), ('C', 'option1')], 
[('D', 'option5'), ('D', 'option1'), ('D', 'option2')], 
[('E', 'option1'), ('E', 'option2'), ('E', 'option3')],
[('F', 'option3'), ('F', 'option1'), ('F', 'option2')]]

Or using pythons sorted:

new_d = [list(zip(repeat(row["index"]), map(it(0), sorted(row[1:].iteritems(), key=it(1) ,reverse=1)[:n])))
                     for _, row in df.iterrows()]

Which is actually the fastest, if you really want strings, it is pretty trivial to format the output however you want.

104

answered Sep 21 '22 22:09

Padraic Cunningham

Let's assume

N = 3

First of all I will create matrix of input fields and for each field remember what was original option for this cell:

matrix = [[(j, 'option' + str(i)) for j in df['option' + str(i)]] for i in range(1,6)]

The result of this line will be:

[
 [(1, 'option1'), (5, 'option1'), (3, 'option1'), (7, 'option1'), (9, 'option1'), (3, 'option1')],
 [(8, 'option2'), (4, 'option2'), (5, 'option2'), (6, 'option2'), (9, 'option2'), (2, 'option2')],
 [(9, 'option3'), (9, 'option3'), (1, 'option3'), (3, 'option3'), (9, 'option3'), (5, 'option3')],
 [(3, 'option4'), (8, 'option4'), (3, 'option4'), (5, 'option4'), (7, 'option4'), (0, 'option4')],
 [(2, 'option5'), (3, 'option5'), (4, 'option5'), (9, 'option5'), (4, 'option5'), (2, 'option5')]
]

Then we can easly transform matrix using zip function, sort result rows by first element of tuple and take N first items:

transformed = [sorted(l, key=lambda x: x[0], reverse=True)[:N] for l in zip(*matrix)]

List transformed will look like:

[
 [(9, 'option3'), (8, 'option2'), (3, 'option4')],
 [(9, 'option3'), (8, 'option4'), (5, 'option1')],
 [(5, 'option2'), (4, 'option5'), (3, 'option1')],
 [(9, 'option5'), (7, 'option1'), (6, 'option2')],
 [(9, 'option1'), (9, 'option2'), (9, 'option3')],
 [(5, 'option3'), (3, 'option1'), (2, 'option2')]
]

The last step will be joining column index and result tuple by:

for id, top in zip(df['index'], transformed):
    for option in top:
        print id + ',' + option[1]
    print ''

answered Sep 18 '22 22:09

kubked

dfc = df.copy()
result = {}

#First, I would effectively transpose this

for key in dfc:
    if key != 'index':
        for i in xrange(0,len(dfc['index'])):
            if dfc['index'][i] not in result:
                result[dfc['index'][i]] = []
            result[dfc['index'][i]] += [(key,dfc[key][i])]


def get_topn(result,n):
    #Use this to get the top value for each option
    return [x[0] for x in sorted(result,key=lambda x:-x[1])[0:min(len(result),n)]]


#Lastly, print the output in your desired format.
n = 3
keys = sorted([k for k in result])
for key in keys:
      for option in get_topn(result[key],n):
         print str(key) + ',' + str(option)
      print