Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas sort a column by values in another column

I have a dataset that I want to sort and assign rank based on it.

Suppose it has two columns, one is year and the other is the column that I want to sort.

import pandas as pd
data = {'year': pd.Series([2006, 2006, 2007, 2007]), 
        'value': pd.Series([5, 10, 4, 1])}
df = pd.DataFrame(data)

I want to sort the column 'value' by each year and then give rank to it. What I would like to have is

data2= {'year': pd.Series([2006, 2006, 2007, 2007]), 
        'value': pd.Series([10, 5, 4, 1]),  
        'rank': pd.Series([1, 2, 1, 2]}
df2=pd.DataFrame(data2)

>>> df2
   rank  value  year
0     1     10  2006
1     2      5  2006
2     1      4  2007
3     2      1  2007
like image 942
John Shin Avatar asked Dec 18 '15 01:12

John Shin


People also ask

How do I sort a column based on another column in pandas?

To sort the DataFrame based on the values in a single column, you'll use . sort_values() . By default, this will return a new DataFrame sorted in ascending order. It does not modify the original DataFrame.

Can you sort a DataFrame with respect to multiple columns?

You can sort pandas DataFrame by one or multiple (one or more) columns using sort_values() method and by ascending or descending order.

How do I sort a column in pandas series?

Sort the Series in Ascending Order By default, the pandas series sort_values() function sorts the series in ascending order. You can also use ascending=True param to explicitly specify to sort in ascending order. Also, if you have any NaN values in the Series, it sort by placing all NaN values at the end.

How do I compare two columns in pandas?

By using the Where() method in NumPy, we are given the condition to compare the columns. If 'column1' is lesser than 'column2' and 'column1' is lesser than the 'column3', We print the values of 'column1'. If the condition fails, we give the value as 'NaN'. These results are stored in the new column in the dataframe.


1 Answers

You can use groupby and then use rank (with ascending=False to get the largest values first). You don't need to sort in the groupby, as the result is indexed to the dataframe (slightly faster performance).

df['yearly_rank'] = df.groupby('year', sort=False)['value'].rank(ascending=False)

>>> df.sort_values(['year', 'yearly_rank'])
   value  year  yearly_rank
1     10  2006            1
0      5  2006            2
2      4  2007            1
3      1  2007            2
like image 51
Alexander Avatar answered Nov 15 '22 00:11

Alexander