Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding top N columns for each row in data frame

given a data frame with one descriptive column and X numeric columns, for each row I'd like to identify the top N columns with the higher values and save it as rows on a new dataframe.

For example, consider the following data frame:

df = pd.DataFrame()
df['index'] = ['A', 'B', 'C', 'D','E', 'F']
df['option1'] = [1,5,3,7,9,3]
df['option2'] = [8,4,5,6,9,2]
df['option3'] = [9,9,1,3,9,5]
df['option4'] = [3,8,3,5,7,0]
df['option5'] = [2,3,4,9,4,2]

enter image description here

I'd like to output (lets say N is 3, so I want the top 3):

A,option3
A,option2
A,option4

B,option3
B,option4
B,option1

C,option2
C,option5
C,option4 (or option1 - ties arent really a problem)

D,option5
D,option1
D,option2

and so on....

any idea how that can be easily achieved? Thanks

like image 862
Diego Avatar asked Dec 15 '15 19:12

Diego


People also ask

How many rows and columns are in a data frame?

As you can see based on Table 1, our example data is a data frame containing 15 rows and two columns. The variable group contains three different group indicators and the variable value contains the corresponding values.

How to get top values for all columns in a Dataframe?

If you like to get top values for all columns you can use the next syntax: This will return the top values per column as a new DataFrame: It's possible to use methods nsmallest and nlargest with datetime. You need to be sure that the target column is from type: datetime64 [ns, UTC]

How to group the columns in a pandas Dataframe?

Here we will use Groupby () function of pandas to group the columns. So we can do it as follows: Firstly, we created a pandas dataframe: Now, we will get topmost N values of each group of the ‘Variables’ column. Here reset_index () is used to provide a new index according to the grouping of data.

How to get the highest and lowest values in pandas Dataframe?

You can use functions: nsmallest - return the first n rows ordered by columns in ascending order nlargest - return the first n rows ordered by columns in descending order to get the top N highest or lowest values in Pandas DataFrame.


3 Answers

If you just want pairings:

from operator import itemgetter as it
from itertools import repeat
n = 3

 # sort_values = order pandas < 0.17
new_d = (zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems())))
                 for _, row in df.iterrows())
for row in new_d:
    print(list(row))

Output:

[('B', 'option3'), ('B', 'option4'), ('B', 'option1')]
[('C', 'option2'), ('C', 'option5'), ('C', 'option1')]
[('D', 'option5'), ('D', 'option1'), ('D', 'option2')]
[('E', 'option1'), ('E', 'option2'), ('E', 'option3')]
[('F', 'option3'), ('F', 'option1'), ('F', 'option2')]

Which also maintains the order.

If you want a list of lists:

from operator import itemgetter as it
from itertools import repeat
n = 3

new_d = [list(zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems()))))
                 for _, row in df.iterrows()]

Output:

[[('A', 'option3'), ('A', 'option2'), ('A', 'option4')],
[('B', 'option3'), ('B', 'option4'), ('B', 'option1')], 
[('C', 'option2'), ('C', 'option5'), ('C', 'option1')], 
[('D', 'option5'), ('D', 'option1'), ('D', 'option2')], 
[('E', 'option1'), ('E', 'option2'), ('E', 'option3')],
[('F', 'option3'), ('F', 'option1'), ('F', 'option2')]]

Or using pythons sorted:

new_d = [list(zip(repeat(row["index"]), map(it(0), sorted(row[1:].iteritems(), key=it(1) ,reverse=1)[:n])))
                     for _, row in df.iterrows()]

Which is actually the fastest, if you really want strings, it is pretty trivial to format the output however you want.

like image 104
Padraic Cunningham Avatar answered Sep 21 '22 22:09

Padraic Cunningham


Let's assume

N = 3

First of all I will create matrix of input fields and for each field remember what was original option for this cell:

matrix = [[(j, 'option' + str(i)) for j in df['option' + str(i)]] for i in range(1,6)]

The result of this line will be:

[
 [(1, 'option1'), (5, 'option1'), (3, 'option1'), (7, 'option1'), (9, 'option1'), (3, 'option1')],
 [(8, 'option2'), (4, 'option2'), (5, 'option2'), (6, 'option2'), (9, 'option2'), (2, 'option2')],
 [(9, 'option3'), (9, 'option3'), (1, 'option3'), (3, 'option3'), (9, 'option3'), (5, 'option3')],
 [(3, 'option4'), (8, 'option4'), (3, 'option4'), (5, 'option4'), (7, 'option4'), (0, 'option4')],
 [(2, 'option5'), (3, 'option5'), (4, 'option5'), (9, 'option5'), (4, 'option5'), (2, 'option5')]
]

Then we can easly transform matrix using zip function, sort result rows by first element of tuple and take N first items:

transformed = [sorted(l, key=lambda x: x[0], reverse=True)[:N] for l in zip(*matrix)]

List transformed will look like:

[
 [(9, 'option3'), (8, 'option2'), (3, 'option4')],
 [(9, 'option3'), (8, 'option4'), (5, 'option1')],
 [(5, 'option2'), (4, 'option5'), (3, 'option1')],
 [(9, 'option5'), (7, 'option1'), (6, 'option2')],
 [(9, 'option1'), (9, 'option2'), (9, 'option3')],
 [(5, 'option3'), (3, 'option1'), (2, 'option2')]
]

The last step will be joining column index and result tuple by:

for id, top in zip(df['index'], transformed):
    for option in top:
        print id + ',' + option[1]
    print ''
like image 30
kubked Avatar answered Sep 18 '22 22:09

kubked


dfc = df.copy()
result = {}

#First, I would effectively transpose this

for key in dfc:
    if key != 'index':
        for i in xrange(0,len(dfc['index'])):
            if dfc['index'][i] not in result:
                result[dfc['index'][i]] = []
            result[dfc['index'][i]] += [(key,dfc[key][i])]


def get_topn(result,n):
    #Use this to get the top value for each option
    return [x[0] for x in sorted(result,key=lambda x:-x[1])[0:min(len(result),n)]]


#Lastly, print the output in your desired format.
n = 3
keys = sorted([k for k in result])
for key in keys:
      for option in get_topn(result[key],n):
         print str(key) + ',' + str(option)
      print
like image 32
Adam W. Cooper Avatar answered Sep 18 '22 22:09

Adam W. Cooper