given a data frame with one descriptive column and X numeric columns, for each row I'd like to identify the top N columns with the higher values and save it as rows on a new dataframe.
For example, consider the following data frame:
df = pd.DataFrame()
df['index'] = ['A', 'B', 'C', 'D','E', 'F']
df['option1'] = [1,5,3,7,9,3]
df['option2'] = [8,4,5,6,9,2]
df['option3'] = [9,9,1,3,9,5]
df['option4'] = [3,8,3,5,7,0]
df['option5'] = [2,3,4,9,4,2]
I'd like to output (lets say N is 3, so I want the top 3):
A,option3
A,option2
A,option4
B,option3
B,option4
B,option1
C,option2
C,option5
C,option4 (or option1 - ties arent really a problem)
D,option5
D,option1
D,option2
and so on....
any idea how that can be easily achieved? Thanks
As you can see based on Table 1, our example data is a data frame containing 15 rows and two columns. The variable group contains three different group indicators and the variable value contains the corresponding values.
If you like to get top values for all columns you can use the next syntax: This will return the top values per column as a new DataFrame: It's possible to use methods nsmallest and nlargest with datetime. You need to be sure that the target column is from type: datetime64 [ns, UTC]
Here we will use Groupby () function of pandas to group the columns. So we can do it as follows: Firstly, we created a pandas dataframe: Now, we will get topmost N values of each group of the ‘Variables’ column. Here reset_index () is used to provide a new index according to the grouping of data.
You can use functions: nsmallest - return the first n rows ordered by columns in ascending order nlargest - return the first n rows ordered by columns in descending order to get the top N highest or lowest values in Pandas DataFrame.
If you just want pairings:
from operator import itemgetter as it
from itertools import repeat
n = 3
# sort_values = order pandas < 0.17
new_d = (zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems())))
for _, row in df.iterrows())
for row in new_d:
print(list(row))
Output:
[('B', 'option3'), ('B', 'option4'), ('B', 'option1')]
[('C', 'option2'), ('C', 'option5'), ('C', 'option1')]
[('D', 'option5'), ('D', 'option1'), ('D', 'option2')]
[('E', 'option1'), ('E', 'option2'), ('E', 'option3')]
[('F', 'option3'), ('F', 'option1'), ('F', 'option2')]
Which also maintains the order.
If you want a list of lists:
from operator import itemgetter as it
from itertools import repeat
n = 3
new_d = [list(zip(repeat(row["index"]), map(it(0),(row[1:].sort_values(ascending=0)[:n].iteritems()))))
for _, row in df.iterrows()]
Output:
[[('A', 'option3'), ('A', 'option2'), ('A', 'option4')],
[('B', 'option3'), ('B', 'option4'), ('B', 'option1')],
[('C', 'option2'), ('C', 'option5'), ('C', 'option1')],
[('D', 'option5'), ('D', 'option1'), ('D', 'option2')],
[('E', 'option1'), ('E', 'option2'), ('E', 'option3')],
[('F', 'option3'), ('F', 'option1'), ('F', 'option2')]]
Or using pythons sorted:
new_d = [list(zip(repeat(row["index"]), map(it(0), sorted(row[1:].iteritems(), key=it(1) ,reverse=1)[:n])))
for _, row in df.iterrows()]
Which is actually the fastest, if you really want strings, it is pretty trivial to format the output however you want.
Let's assume
N = 3
First of all I will create matrix of input fields and for each field remember what was original option for this cell:
matrix = [[(j, 'option' + str(i)) for j in df['option' + str(i)]] for i in range(1,6)]
The result of this line will be:
[
[(1, 'option1'), (5, 'option1'), (3, 'option1'), (7, 'option1'), (9, 'option1'), (3, 'option1')],
[(8, 'option2'), (4, 'option2'), (5, 'option2'), (6, 'option2'), (9, 'option2'), (2, 'option2')],
[(9, 'option3'), (9, 'option3'), (1, 'option3'), (3, 'option3'), (9, 'option3'), (5, 'option3')],
[(3, 'option4'), (8, 'option4'), (3, 'option4'), (5, 'option4'), (7, 'option4'), (0, 'option4')],
[(2, 'option5'), (3, 'option5'), (4, 'option5'), (9, 'option5'), (4, 'option5'), (2, 'option5')]
]
Then we can easly transform matrix using zip function, sort result rows by first element of tuple and take N first items:
transformed = [sorted(l, key=lambda x: x[0], reverse=True)[:N] for l in zip(*matrix)]
List transformed will look like:
[
[(9, 'option3'), (8, 'option2'), (3, 'option4')],
[(9, 'option3'), (8, 'option4'), (5, 'option1')],
[(5, 'option2'), (4, 'option5'), (3, 'option1')],
[(9, 'option5'), (7, 'option1'), (6, 'option2')],
[(9, 'option1'), (9, 'option2'), (9, 'option3')],
[(5, 'option3'), (3, 'option1'), (2, 'option2')]
]
The last step will be joining column index and result tuple by:
for id, top in zip(df['index'], transformed):
for option in top:
print id + ',' + option[1]
print ''
dfc = df.copy()
result = {}
#First, I would effectively transpose this
for key in dfc:
if key != 'index':
for i in xrange(0,len(dfc['index'])):
if dfc['index'][i] not in result:
result[dfc['index'][i]] = []
result[dfc['index'][i]] += [(key,dfc[key][i])]
def get_topn(result,n):
#Use this to get the top value for each option
return [x[0] for x in sorted(result,key=lambda x:-x[1])[0:min(len(result),n)]]
#Lastly, print the output in your desired format.
n = 3
keys = sorted([k for k in result])
for key in keys:
for option in get_topn(result[key],n):
print str(key) + ',' + str(option)
print
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With