Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Alternative to rank() function that gives unique ordinal ranks for a column

At this moment I am writing a Python script that aggregates data from multiple Excel sheets. The module I choose to use is Pandas, because of its speed and ease of use with Excel files. The question is only related to the use of Pandas and me trying to create a additional column that contains unique, integer-only, ordinal ranks within a group.

My Python and Pandas knowledge is limited as I am just a beginner.

The Goal

I am trying to achieve the following data structure. Where the top 10 adwords ads are ranked vertically on the basis of their position in Google. In order to do this I need to create a column in the original data (see Table 2 & 3) with a integer-only ranking that contains no duplicate values.

Table 1: Data structure I am trying to achieve

    device  , weeks   , rank_1   , rank_2   , rank_3   , rank_4   , rank_5
    mobile  , wk 1    , string   , string   , string   , string   , string 
    mobile  , wk 2    , string   , string   , string   , string   , string 
    computer, wk 1    , string   , string   , string   , string   , string
    computer, wk 2    , string   , string   , string   , string   , string

The Problem

The exact problem I run into is not being able to efficiently rank the rows with pandas. I have tried a number of things, but I cannot seem to get it ranked in this way.

Table 2: Data structure I have

    weeks    device   , website  , ranking  , adtext
    wk 1     mobile   , url1     , *2.1     , string
    wk 1     mobile   , url2     , *2.1     , string
    wk 1     mobile   , url3     , 1.0      , string
    wk 1     mobile   , url4     , 2.9      , string
    wk 1     desktop  , *url5    , 2.1      , string
    wk 1     desktop  , url2     , *1.5     , string
    wk 1     desktop  , url3     , *1.5     , string
    wk 1     desktop  , url4     , 2.9      , string
    wk 2     mobile   , url1     , 2.0      , string
    wk 2     mobile   , *url6    , 2.1      , string
    wk 2     mobile   , url3     , 1.0      , string
    wk 2     mobile   , url4     , 2.9      , string
    wk 2     desktop  , *url5    , 2.1      , string
    wk 2     desktop  , url2     , *2.9     , string
    wk 2     desktop  , url3     , 1.0      , string
    wk 2     desktop  , url4     , *2.9     , string

Table 3: The table I cannot seem to create

    weeks    device   , website  , ranking  , adtext  , ranking
    wk 1     mobile   , url1     , *2.1     , string  , 2
    wk 1     mobile   , url2     , *2.1     , string  , 3
    wk 1     mobile   , url3     , 1.0      , string  , 1
    wk 1     mobile   , url4     , 2.9      , string  , 4
    wk 1     desktop  , *url5    , 2.1      , string  , 3
    wk 1     desktop  , url2     , *1.5     , string  , 1
    wk 1     desktop  , url3     , *1.5     , string  , 2
    wk 1     desktop  , url4     , 2.9      , string  , 4
    wk 2     mobile   , url1     , 2.0      , string  , 2
    wk 2     mobile   , *url6    , 2.1      , string  , 3
    wk 2     mobile   , url3     , 1.0      , string  , 1
    wk 2     mobile   , url4     , 2.9      , string  , 4
    wk 2     desktop  , *url5    , 2.1      , string  , 2
    wk 2     desktop  , url2     , *2.9     , string  , 3
    wk 2     desktop  , url3     , 1.0      , string  , 1
    wk 2     desktop  , url4     , *2.9     , string  , 4

The standard .rank(ascending=True), gives averages on duplicate values. But since I use these ranks to organize them vertically this does not work out.

df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])

df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( ascending=True)

The .rank(method="dense", ascending=True) maintains duplicate values and also does not solve my problem

df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])

df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="dense", ascending=True)

The .rank(method="first", ascending=True) throws a ValueError

df = df.sort_values(['device', 'weeks', 'ranking'], ascending=[True, True, True])

df['newrank'] = df.groupby(['device', 'week'])['ranking'].rank( method="first", ascending=True)

ADDENDUM: If I would find a way to add the rankings in a column, I would then use pivot to transpose the table in the following way.

df = pd.pivot_table(df, index = ['device', 'weeks'], columns='website', values='adtext', aggfunc=lambda x: ' '.join(x))

My question to you

I was hoping any of you could help me find a solution for this problem. This could either an efficient ranking script or something else to help me reach the final data structure.

Thank you!

Sebastiaan


EDIT: Unfortunately, I think I was not clear in my original post. I am looking for a ordinal ranking that only gives integers and has no duplicate values. This means that when there is a duplicate value it will randomly give one a higher ranking than the other.

So what I would like to do is generate a ranking that labels each row with an ordinal value per group. The groups are based on the week number and device. The reason I want to create a new column with this ranking is so that I can make top 10s per week and device.

Also Steven G asked me for an example to play around with. I have provided that here.

Example data can be pasted directly into python

! IMPORTANT: The names are different in this sample. The dataframe is called placeholder, the column names are as follows: 'week', 'website', 'share', 'rank_google', 'device'.

data = {u'week': [u'WK 1', u'WK 2', u'WK 3', u'WK 4', u'WK 2', u'WK 2', u'WK 1',
u'WK 3', u'WK 4', u'WK 3', u'WK 3', u'WK 4', u'WK 2', u'WK 4', u'WK 1', u'WK 1',
u'WK3', u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 4', u'WK 4',
u'WK 4', u'WK 4', u'WK 2', u'WK 3', u'WK 4', u'WK 3', u'WK 4', u'WK 3', u'WK 2',
u'WK 2', u'WK 4', u'WK 1', u'WK 1', u'WK 4', u'WK 4', u'WK 2', u'WK 1', u'WK 3',
u'WK 1', u'WK 4', u'WK 1', u'WK 4', u'WK 2', u'WK 2', u'WK 2', u'WK 4', u'WK 4',
u'WK 4', u'WK 1', u'WK 3', u'WK 4', u'WK 4', u'WK 1', u'WK 4', u'WK 3', u'WK 2',
u'WK 4', u'WK 4', u'WK 4', u'WK 4', u'WK 1'],
u'website': [u'site1.nl', u'website2.de', u'site1.nl', u'site1.nl', u'anothersite.com',
u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at',
u'url2.at', u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at',
u'url2.at', u'url2.at', u'url2.at', u'anothersite.com', u'url2.at', u'url2.at',
u'anothersite.com', u'site2.co.uk', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk',
u'sitename2.com', u'sitename2.com', u'sitename2.com', u'url3.fi', u'sitename.co.uk',
u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'ulr2.se',
u'sitename2.com', u'sitename.co.uk', u'sitename2.com', u'sitename2.com', u'sitename2.com',
u'sitename2.com', u'sitename2.com', u'sitename.co.uk', u'sitename.co.uk', u'sitename2.com',
u'facebook.com', u'alsoasite.com', u'ello.com', u'instagram.com', u'alsoasite.com', u'facebook.com',
u'facebook.com', u'singleboersen-vergleich.at', u'facebook.com', u'anothername.com', u'twitter.com',
u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'alsoasite.com', u'facebook.com', u'alsoasite.com',
u'alsoasite.com'],
'adtext': [u'site1.nl 3,9 | < 10\xa0%', u'website2.de 1,4 | < 10\xa0%', u'site1.nl 4,3 | < 10\xa0%',
u'site1.nl 3,8 | < 10\xa0%', u'anothersite.com 2,5 | 12,36 %', u'url2.at 1,3 | 78,68 %', u'url2.at 1,2 | 92,58 %',
u'url2.at 1,1 | 85,47 %', u'url2.at 1,2 | 79,56 %', u'anothersite.com 2,8 | < 10\xa0%', u'url2.at 1,2 | 80,48 %',
u'url2.at 1,2 | 85,63 %', u'url2.at 1,1 | 88,36 %', u'url2.at 1,3 | 87,90 %', u'url2.at 1,1 | 83,70 %',
u'anothersite.com 3,1 | < 10\xa0%', u'url2.at 1,2 | 91,00 %', u'url2.at 1,1 | 92,11 %', u'url2.at 1,2 | 81,28 %'
, u'url2.at 1,1 | 86,49 %', u'anothersite.com 2,7 | < 10\xa0%', u'url2.at 1,2 | 83,96 %', u'url2.at 1,2 | 75,48 %'
, u'anothersite.com 3,0 | < 10\xa0%', u'site2.co.uk 3,1 | 16,24 %', u'sitename2.com 2,3 | 34,85 %',
u'sitename.co.uk 3,5 | < 10\xa0%', u'sitename.co.uk 3,6 | < 10\xa0%', u'sitename2.com 2,1 | < 10\xa0%',
u'sitename2.com 2,2 | 13,55 %', u'sitename2.com 2,1 | 47,91 %', u'url3.fi 3,4 | < 10\xa0%',
u'sitename.co.uk 3,1 | 14,15 %', u'sitename2.com 2,4 | 28,77 %', u'sitename.co.uk 3,1 | 22,55 %',
u'sitename2.com 2,1 | 17,03 %', u'sitename2.com 2,1 | 24,46 %', u'ulr2.se 2,7 | < 10\xa0%',
u'sitename2.com 2,0 | 49,12 %', u'sitename.co.uk 3,0 | < 10\xa0%', u'sitename2.com 2,1 | 40,00 %',
u'sitename2.com 2,1 | < 10\xa0%', u'sitename2.com 2,2 | 30,29 %', u'sitename2.com 2,0 |47,48 %',
u'sitename2.com 2,1 | 32,17 %', u'sitename.co.uk 3,2 | < 10\xa0%', u'sitename.co.uk 3,1 | 12,77 %',
u'sitename2.com 2,6 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'alsoasite.com 2,3 | < 10\xa0%',
u'ello.com 1,8 | < 10\xa0%',u'instagram.com 5,0 | < 10\xa0%', u'alsoasite.com 2,2 | < 10\xa0%',
u'facebook.com 3,0 | < 10\xa0%', u'facebook.com 3,2 | < 10\xa0%', u'singleboersen-vergleich.at 2,6 | < 10\xa0%',
u'facebook.com 3,4 | < 10\xa0%', u'anothername.com 1,9 | <10\xa0%', u'twitter.com 4,4 | < 10\xa0%',
u'alsoasite.com 1,1 | 12,35 %', u'alsoasite.com 1,1 | 11,22 %', u'alsoasite.com 2,0 | < 10\xa0%',
u'alsoasite.com 1,1| 10,86 %', u'facebook.com 3,4 | < 10\xa0%', u'alsoasite.com 1,1 | 10,82 %',
u'alsoasite.com 1,1 | < 10\xa0%'],
u'share': [u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,36 %', u'78,68 %',
u'92,58 %', u'85,47 %', u'79,56 %', u'< 10\xa0%', u'80,48 %', u'85,63 %', u'88,36 %',
u'87,90 %', u'83,70 %', u'< 10\xa0%', u'91,00 %', u'92,11 %', u'81,28 %', u'86,49 %',
u'< 10\xa0%', u'83,96 %', u'75,48 %', u'< 10\xa0%', u'16,24 %', u'34,85 %', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'13,55 %', u'47,91 %', u'< 10\xa0%', u'14,15 %', u'28,77 %',
u'22,55 %', u'17,03 %', u'24,46 %', u'< 10\xa0%', u'49,12 %', u'< 10\xa0%', u'40,00 %',
u'< 10\xa0%', u'30,29 %', u'47,48 %', u'32,17 %', u'< 10\xa0%', u'12,77 %', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%',
u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'< 10\xa0%', u'12,35 %', u'11,22 %', u'< 10\xa0%',
u'10,86 %', u'< 10\xa0%', u'10,82 %', u'< 10\xa0%'],
u'rank_google': [u'3,9', u'1,4', u'4,3', u'3,8', u'2,5', u'1,3', u'1,2', u'1,1', u'1,2', u'2,8',
u'1,2', u'1,2', u'1,1', u'1,3', u'1,1', u'3,1', u'1,2', u'1,1', u'1,2', u'1,1', u'2,7', u'1,2',
u'1,2', u'3,0', u'3,1', u'2,3', u'3,5', u'3,6', u'2,1', u'2,2', u'2,1', u'3,4', u'3,1', u'2,4',
u'3,1', u'2,1', u'2,1', u'2,7', u'2,0', u'3,0', u'2,1', u'2,1', u'2,2', u'2,0', u'2,1', u'3,2',
u'3,1', u'2,6', u'3,2', u'2,3', u'1,8', u'5,0', u'2,2', u'3,0', u'3,2', u'2,6', u'3,4', u'1,9',
u'4,4', u'1,1', u'1,1', u'2,0', u'1,1', u'3,4', u'1,1', u'1,1'],
u'device': [u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Tablet', u'Computer',
u'Mobile', u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Computer', u'Tablet', u'Tablet',
u'Tablet', u'Mobile', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet', u'Tablet', u'Mobile',
u'Tablet', u'Mobile', u'Computer', u'Computer', u'Tablet', u'Mobile', u'Tablet', u'Mobile', u'Tablet',
u'Mobile', u'Mobile', u'Mobile', u'Tablet', u'Computer', u'Tablet', u'Computer', u'Mobile', u'Tablet',
u'Tablet', u'Tablet', u'Mobile', u'Computer', u'Mobile', u'Computer', u'Tablet', u'Tablet', u'Tablet',
u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Tablet', u'Mobile', u'Mobile', u'Computer',
u'Mobile', u'Tablet', u'Mobile', u'Mobile']}

placeholder = pd.DataFrame(data)

Error I receive when I use the rank() function with method='first'

C:\Users\username\code\report-creator>python recomp-report-04.py
Traceback (most recent call last):
  File "recomp-report-04.py", line 71, in <module>
    placeholder['ranking'] = placeholder.groupby(['week', 'device'])['rank_googl
e'].rank(method='first').astype(int)
  File "<string>", line 35, in rank
  File "C:\Users\sthuis\AppData\Local\Continuum\Anaconda2\lib\site-packages\pand
as\core\groupby.py", line 561, in wrapper
    raise ValueError
ValueError

My solution

Effectively, the answer is given by @Nickil Maveli. A huge thank you! Nevertheless, I thought it might be smart to outline how I finally incorporated the solution.

Rank(method='first') is a good way to get an ordinal ranking. But since I was working with numbers that were formatted in the European way, pandas interpreted them as strings and could not rank them this way. I came to this conclusion by the reaction of Nickil Maveli and trying to rank each group individually. I did that through the following code.

for name, group in df.sort_values(by='rank_google').groupby(['weeks', 'device']):
    df['new_rank'] = group['ranking'].rank(method='first').astype(int)

This gave me the following error:

ValueError: first not supported for non-numeric data

So this helped me realize that I should convert the column to floats. This is how I did it.

# Converting the ranking column to a float
df['ranking'] = df['ranking'].apply(lambda x: float(unicode(x.replace(',','.'))))

# Creating a new column with a rank
df['new_rank'] = df.groupby(['weeks', 'device'])['ranking'].rank(method='first').astype(int)

# Dropping all ranks after the 10
df = df.sort_values('new_rank').groupby(['weeks', 'device']).head(n=10)

# Pivotting the column
df = pd.pivot_table(df, index = ['device', 'weeks'], columns='new_rank', values='adtext', aggfunc=lambda x: ' '.join(x))

# Naming the columns with 'top' + number
df.columns = ['top ' + str(i) for i in list(df.columns.values)]

So this worked for me. Thank you guys!

like image 323
Fluous Avatar asked Sep 26 '16 15:09

Fluous


People also ask

How does Python calculate percentile rank?

We will use the rank() function with the argument pct = True to find the percentile rank.


1 Answers

I think the way you were trying to use the method=first to rank them after sorting were causing problems.

You could simply use the rank method with first arg on the grouped object itself giving you the desired unique ranks per group.

df['new_rank'] = df.groupby(['weeks','device'])['ranking'].rank(method='first').astype(int)
print (df['new_rank'])

0     2
1     3
2     1
3     4
4     3
5     1
6     2
7     4
8     2
9     3
10    1
11    4
12    2
13    3
14    1
15    4
Name: new_rank, dtype: int32

Perform pivot operation:

df = df.pivot_table(index=['weeks', 'device'], columns=['new_rank'],
                    values=['adtext'], aggfunc=lambda x: ' '.join(x))

Choose the second level of the multiindex columns which pertain to the rank numbers:

df.columns = ['rank_' + str(i) for i in df.columns.get_level_values(1)]
df

Image_2


Data:(to replicate)

df = pd.DataFrame({'weeks': ['wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1', 'wk 1',
                             'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2', 'wk 2'],
                  'device': ['mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop',
                             'mobile', 'mobile', 'mobile', 'mobile', 'desktop', 'desktop', 'desktop', 'desktop'],
                  'website': ['url1', 'url2', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4',
                             'url1', 'url16', 'url3', 'url4', 'url5', 'url2', 'url3', 'url4'],
                  'ranking': [2.1, 2.1, 1.0, 2.9, 2.1, 1.5, 1.5, 2.9, 
                              2.0, 2.1, 1.0, 2.9, 2.1, 2.9, 1.0, 2.9],
                  'adtext': ['string', 'string', 'string', 'string', 'string', 'string', 'string', 'string',
                             'string', 'string', 'string', 'string', 'string', 'string', 'string', 'string']})

Note: method=first assigns ranks in the order they appear in the array/series.

like image 172
Nickil Maveli Avatar answered Oct 20 '22 13:10

Nickil Maveli