Pandas - Alternative to rank() function that gives unique ordinal ranks for a column

Tags:

At this moment I am writing a Python script that aggregates data from multiple Excel sheets. The module I choose to use is Pandas, because of its speed and ease of use with Excel files. The question is only related to the use of Pandas and me trying to create a additional column that contains unique, integer-only, ordinal ranks within a group.

My Python and Pandas knowledge is limited as I am just a beginner.

The Goal

I am trying to achieve the following data structure. Where the top 10 adwords ads are ranked vertically on the basis of their position in Google. In order to do this I need to create a column in the original data (see Table 2 & 3) with a integer-only ranking that contains no duplicate values.

Table 1: Data structure I am trying to achieve

    device  , weeks   , rank_1   , rank_2   , rank_3   , rank_4   , rank_5
    mobile  , wk 1    , string   , string   , string   , string   , string 
    mobile  , wk 2    , string   , string   , string   , string   , string 
    computer, wk 1    , string   , string   , string   , string   , string
    computer, wk 2    , string   , string   , string   , string   , string

The Problem

The exact problem I run into is not being able to efficiently rank the rows with pandas. I have tried a number of things, but I cannot seem to get it ranked in this way.

Table 2: Data structure I have

    weeks    device   , website  , ranking  , adtext
    wk 1     mobile   , url1     , *2.1     , string
    wk 1     mobile   , url2     , *2.1     , string
    wk 1     mobile   , url3     , 1.0      , string
    wk 1     mobile   , url4     , 2.9      , string
    wk 1     desktop  , *url5    , 2.1      , string
    wk 1     desktop  , url2     , *1.5     , string
    wk 1     desktop  , url3     , *1.5     , string
    wk 1     desktop  , url4     , 2.9      , string
    wk 2     mobile   , url1     , 2.0      , string
    wk 2     mobile   , *url6    , 2.1      , string
    wk 2     mobile   , url3     , 1.0      , string
    wk 2     mobile   , url4     , 2.9      , string
    wk 2     desktop  , *url5    , 2.1      , string
    wk 2     desktop  , url2     , *2.9     , string
    wk 2     desktop  , url3     , 1.0      , string
    wk 2     desktop  , url4     , *2.9     , string

Table 3: The table I cannot seem to create

    weeks    device   , website  , ranking  , adtext  , ranking
    wk 1     mobile   , url1     , *2.1     , string  , 2
    wk 1     mobile   , url2     , *2.1     , string  , 3
    wk 1     mobile   , url3     , 1.0      , string  , 1
    wk 1     mobile   , url4     , 2.9      , string  , 4
    wk 1     desktop  , *url5    , 2.1      , string  , 3
    wk 1     desktop  , url2     , *1.5     , string  , 1
    wk 1     desktop  , url3     , *1.5     , string  , 2
    wk 1     desktop  , url4     , 2.9      , string  , 4
    wk 2     mobile   , url1     , 2.0      , string  , 2
    wk 2     mobile   , *url6    , 2.1      , string  , 3
    wk 2     mobile   , url3     , 1.0      , string  , 1
    wk 2     mobile   , url4     , 2.9      , string  , 4
    wk 2     desktop  , *url5    , 2.1      , string  , 2
    wk 2     desktop  , url2     , *2.9     , string  , 3
    wk 2     desktop  , url3     , 1.0      , string  , 1
    wk 2     desktop  , url4     , *2.9     , string  , 4

The standard .rank(ascending=True), gives averages on duplicate values. But since I use these ranks to organize them vertically this does not work out.

df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])

df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( ascending=True)

The .rank(method="dense", ascending=True) maintains duplicate values and also does not solve my problem

df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])

df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="dense", ascending=True)

The .rank(method="first", ascending=True) throws a ValueError

df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])

df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="first", ascending=True)

ADDENDUM: If I would find a way to add the rankings in a column, I would then use pivot to transpose the table in the following way.

df = pd.pivot_table(df, index = ['device', 'weeks'], columns='website', values='adtext', aggfunc=lambda x: ' '.join(x))

My question to you

I was hoping any of you could help me find a solution for this problem. This could either an efficient ranking script or something else to help me reach the final data structure.

Thank you!

Sebastiaan

EDIT: Unfortunately, I think I was not clear in my original post. I am looking for a ordinal ranking that only gives integers and has no duplicate values. This means that when there is a duplicate value it will randomly give one a higher ranking than the other.

So what I would like to do is generate a ranking that labels each row with an ordinal value per group. The groups are based on the week number and device. The reason I want to create a new column with this ranking is so that I can make top 10s per week and device.

Also Steven G asked me for an example to play around with. I have provided that here.

Example data can be pasted directly into python

! IMPORTANT: The names are different in this sample. The dataframe is called placeholder, the column names are as follows: 'week', 'website', 'share', 'rank_google', 'device'.

data = {u'week': [u'WK 1', u'WK 2', u'WK 3', u'WK 4', u'WK 2', u'WK 2', u'WK 1',
u'WK 3', u'WK 4', u'WK 3', u'WK 3', u'WK 4', u'WK 2', u'WK 4', u'WK 1', u'WK 1',
u'WK3', u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 4', u'WK 4',
u'WK 4', u'WK 4', u'WK 2', u'WK 3', u'WK 4', u'WK 3', u'WK 4', u'WK 3', u'WK 2',
u'WK 2', u'WK 4', u'WK 1', u'WK 1', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 3',
u'WK 1', u'WK 4', u'WK 1', u'WK 4', u'WK 2', u'WK 2', u'WK 2', u'WK 4', u'WK 4',
u'WK 4', u'WK 1', u'WK 3', u'WK 4', u'WK 4', u'WK 1', u'WK 4', u'WK 3', u'WK 2',
u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 1'],
u'website': [u'site1.nl', u'website2.de', u'site1.nl', u'site1.nl', u'anothersite.com',
u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at',
u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at',
u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at', u'url2.at',
u'anothersite.com', u'site2.co.uk', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk',
u'sitename2.com', u'sitename2.com', u'sitename2.com', u'url3.fi', u'sitename.co.uk',
u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'ulr2.se',
u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'sitename2.com',
u'sitename2.com', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk', u'sitename2.com',
u'facebook.com', u'alsoasite.com', u'ello.com', u'instagram.com', u'alsoasite.com', u'facebook.com',
u'facebook.com', u'singleboersen-vergleich.at', u'facebook.com', u'anothername.com', u'twitter.com',
u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'facebook.com', u'alsoasite.com',
u'alsoasite.com'],
'adtext': [u'site1.nl 3,9 | < 10\xa0%', u'website2.de 1,4 | < 10\xa0%', u'site1.nl 4,3 | < 10\xa0%',
u'site1.nl 3,8 | < 10\xa0%', u'anothersite.com 2,5 | 12,36 %', u'url2.at 1,3 | 78,68 %', u'url2.at 1,2 | 92,58 %',
u'url2.at 1,1 | 85,47 %', u'url2.at 1,2 | 79,56 %', u'anothersite.com 2,8 | < 10\xa0%', u'url2.at 1,2 | 80,48 %',
u'url2.at 1,2 | 85,63 %', u'url2.at 1,1 | 88,36 %', u'url2.at 1,3 | 87,90 %', u'url2.at 1,1 | 83,70 %',
u'anothersite.com 3,1 | < 10\xa0%', u'url2.at 1,2 | 91,00 %', u'url2.at 1,1 | 92,11 %', u'url2.at 1,2 | 81,28 %'
, u'url2.at 1,1 | 86,49 %', u'anothersite.com 2,7 | < 10\xa0%', u'url2.at 1,2 | 83,96 %', u'url2.at 1,2 | 75,48 %'
, u'anothersite.com 3,0 | < 10\xa0%', u'site2.co.uk 3,1 | 16,24 %', u'sitename2.com 2,3 | 34,85 %',
u'sitename.co.uk 3,5 | < 10\xa0%', u'sitename.co.uk 3,6 | < 10\xa0%', u'sitename2.com 2,1 | < 10\xa0%',
u'sitename2.com 2,2 | 13,55 %', u'sitename2.com 2,1 | 47,91 %', u'url3.fi 3,4 | < 10\xa0%',
u'sitename.co.uk 3,1 | 14,15 %', u'sitename2.com 2,4 | 28,77 %', u'sitename.co.uk 3,1 | 22,55 %',
u'sitename2.com 2,1 | 17,03 %', u'sitename2.com 2,1 | 24,46 %', u'ulr2.se 2,7 | < 10\xa0%',
u'sitename2.com 2,0 | 49,12 %', u'sitename.co.uk 3,0 | < 10\xa0%', u'sitename2.com 2,1 | 40,00 %',
u'sitename2.com 2,1 | < 10\xa0%', u'sitename2.com 2,2 | 30,29 %', u'sitename2.com 2,0 |47,48 %',
u'sitename2.com 2,1 | 32,17 %', u'sitename.co.uk 3,2 | < 10\xa0%', u'sitename.co.uk 3,1 | 12,77 %',
u'sitename2.com 2,6 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'alsoasite.com 2,3 | < 10\xa0%',
u'ello.com 1,8 | < 10\xa0%',u'instagram.com 5,0 | < 10\xa0%', u'alsoasite.com 2,2 | < 10\xa0%',
u'facebook.com 3,0 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'singleboersen-vergleich.at 2,6 | < 10\xa0%',
u'facebook.com 3,4 | < 10\xa0%', u'anothername.com 1,9 | <10\xa0%', u'twitter.com 4,4 | < 10\xa0%',
u'alsoasite.com 1,1 | 12,35 %', u'alsoasite.com 1,1 | 11,22 %', u'alsoasite.com 2,0 | < 10\xa0%',
u'alsoasite.com 1,1| 10,86 %', u'facebook.com 3,4 | < 10\xa0%', u'alsoasite.com 1,1 | 10,82 %',
u'alsoasite.com 1,1 | < 10\xa0%'],
u'share': [u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,36 %', u'78,68 %',
u'92,58 %', u'85,47 %', u'79,56 %', u'< 10\xa0%', u'80,48 %', u'85,63 %', u'88,36 %',
u'87,90 %', u'83,70 %', u'< 10\xa0%', u'91,00 %', u'92,11 %', u'81,28 %', u'86,49 %',
u'< 10\xa0%', u'83,96 %', u'75,48 %', u'< 10\xa0%', u'16,24 %', u'34,85 %', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'13,55 %', u'47,91 %', u'< 10\xa0%', u'14,15 %', u'28,77 %',
u'22,55 %', u'17,03 %', u'24,46 %', u'< 10\xa0%', u'49,12 %', u'< 10\xa0%', u'40,00 %',
u'< 10\xa0%', u'30,29 %', u'47,48 %', u'32,17 %', u'< 10\xa0%', u'12,77 %', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,35 %', u'11,22 %', u'< 10\xa0%',
u'10,86 %', u'< 10\xa0%', u'10,82 %', u'< 10\xa0%'],
u'rank_google': [u'3,9', u'1,4', u'4,3', u'3,8', u'2,5', u'1,3', u'1,2', u'1,1', u'1,2', u'2,8',
u'1,2', u'1,2', u'1,1', u'1,3', u'1,1', u'3,1', u'1,2', u'1,1', u'1,2', u'1,1', u'2,7', u'1,2',
u'1,2', u'3,0', u'3,1', u'2,3', u'3,5', u'3,6', u'2,1', u'2,2', u'2,1', u'3,4', u'3,1', u'2,4',
u'3,1', u'2,1', u'2,1', u'2,7', u'2,0', u'3,0', u'2,1', u'2,1', u'2,2', u'2,0', u'2,1', u'3,2',
u'3,1', u'2,6', u'3,2', u'2,3', u'1,8', u'5,0', u'2,2', u'3,0', u'3,2', u'2,6', u'3,4', u'1,9',
u'4,4', u'1,1', u'1,1', u'2,0', u'1,1', u'3,4', u'1,1', u'1,1'],
u'device': [u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Tablet', u'Computer',
u'Mobile', u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Computer', u'Tablet', u'Tablet',
u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet', u'Tablet', u'Mobile',
u'Tablet', u'Mobile', u'Computer', u'Computer', u'Tablet', u'Mobile', u'Tablet', u'Mobile', u'Tablet',
u'Mobile', u'Mobile', u'Mobile', u'Tablet', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet',
u'Tablet', u'Tablet', u'Mobile', u'Computer', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Tablet',
u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Computer',
u'Mobile', u'Tablet', u'Mobile', u'Mobile']}

placeholder = pd.DataFrame(data)

Error I receive when I use the rank() function with method='first'

C:\Users\username\code\report-creator>python recomp-report-04.py
Traceback (most recent call last):
  File "recomp-report-04.py", line 71, in <module>
    placeholder['ranking'] = placeholder.groupby(['week', 'device'])['rank_googl
e'].rank(method='first').astype(int)
  File "<string>", line 35, in rank
  File "C:\Users\sthuis\AppData\Local\Continuum\Anaconda2\lib\site-packages\pand
as\core\groupby.py", line 561, in wrapper
    raise ValueError
ValueError

My solution

Effectively, the answer is given by @Nickil Maveli. A huge thank you! Nevertheless, I thought it might be smart to outline how I finally incorporated the solution.

Rank(method='first') is a good way to get an ordinal ranking. But since I was working with numbers that were formatted in the European way, pandas interpreted them as strings and could not rank them this way. I came to this conclusion by the reaction of Nickil Maveli and trying to rank each group individually. I did that through the following code.

for name, group in df.sort_values(by='rank_google').groupby(['weeks', 'device']):
    df['new_rank'] = group['ranking'].rank(method='first').astype(int)

This gave me the following error:

ValueError: first not supported for non-numeric data

So this helped me realize that I should convert the column to floats. This is how I did it.

# Converting the ranking column to a float
df['ranking'] = df['ranking'].apply(lambda x: float(unicode(x.replace(',','.'))))

# Creating a new column with a rank
df['new_rank'] = df.groupby(['weeks', 'device'])['ranking'].rank(method='first').astype(int)

# Dropping all ranks after the 10
df = df.sort_values('new_rank').groupby(['weeks', 'device']).head(n=10)

# Pivotting the column
df = pd.pivot_table(df, index = ['device', 'weeks'], columns='new_rank', values='adtext', aggfunc=lambda x: ' '.join(x))

# Naming the columns with 'top' + number
df.columns = ['top ' + str(i) for i in list(df.columns.values)]

So this worked for me. Thank you guys!

323

asked Sep 26 '16 15:09

Fluous

1 Answers

I think the way you were trying to use the method=first to rank them after sorting were causing problems.

You could simply use the rank method with first arg on the grouped object itself giving you the desired unique ranks per group.

df['new_rank'] = df.groupby(['weeks','device'])['ranking'].rank(method='first').astype(int)
print (df['new_rank'])

0     2
1     3
2     1
3     4
4     3
5     1
6     2
7     4
8     2
9     3
10    1
11    4
12    2
13    3
14    1
15    4
Name: new_rank, dtype: int32

Perform pivot operation:

df = df.pivot_table(index=['weeks', 'device'], columns=['new_rank'],
                    values=['adtext'], aggfunc=lambda x: ' '.join(x))

Choose the second level of the multiindex columns which pertain to the rank numbers:

df.columns = ['rank_' + str(i) for i in df.columns.get_level_values(1)]
df

Data:(to replicate)

df = pd.DataFrame({'weeks': ['wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1',
                             'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2'],
                  'device': ['mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop',
                             'mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop'],
                  'website': ['url1', 'url2', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4',
                             'url1', 'url16', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4'],
                  'ranking': [2.1, 2.1, 1.0, 2.9, 2.1, 1.5, 1.5, 2.9, 
                              2.0, 2.1, 1.0, 2.9, 2.1, 2.9, 1.0, 2.9],
                  'adtext': ['string', 'string', 'string', 'string', 'string', 'string', 'string', 'string',
                             'string', 'string', 'string', 'string', 'string', 'string', 'string', 'string']})

Note: method=first assigns ranks in the order they appear in the array/series.

172

answered Oct 20 '22 13:10

Nickil Maveli

Related questions
                            
                                drop rows with errors for pandas data coercion
                            
                                OpenCV-Python: How to detect a hotspot in thermal image?
                            
                                Py2exe error: [Errno 2] No such file or directory
                            
                                Initial node's ids when creating graph from edge list
                            
                                Private Variables and Class-local References
                            
                                Use base class's property/attribute as a table column?
                            
                                Exception when training data in Predictionio
                            
                                Counting how many times a row occurs in a matrix (numpy)
                            
                                Pandas cannot load data, csv encoding mystery
                            
                                Converting dictionary with known indices to a multidimensional array
                            
                                How to run a process with timeout and still get stdout at runtime
                            
                                Detection of irregular shapes using houghcircle function opencv python
                            
                                netCDF4 import not being found by Python
                            
                                Split string by the first occurance from a set of delimiters with Python and regex
                            
                                Python's mock throwing AttributeError: 'module' object has no attribute 'patch'
                            
                                IndexError: fail to coerce slice entry of type tensorvariable to integer
                            
                                Why does getaddrinfo not return all IP Addresses?
                            
                                update dataframe with series
                            
                                Install python package zipline on cloud 9 environment Support workspace python
                            
                                Is subclassing from object the same as defining type as metaclass?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas - Alternative to rank() function that gives unique ordinal ranks for a column

Tags:

python

pandas

ordinal

ranking

rank

Fluous

People also ask

1 Answers

Nickil Maveli

Recent Activity

Donate For Us