Rank based on several variables

Q: How do you rank each set of data?

By default, ranks are assigned by ordering the data values in ascending order (smallest to largest), then labeling the smallest value as rank 1. Alternatively, Largest value orders the data in descending order (largest to smallest), and assigns the largest value the rank of 1.

Q: How do you rank variables?

To rank more than one variable, specify a variable list. After the variable list, you can specify the direction for ranking in parentheses. Specify A for ascending (smallest value gets smallest rank) or D for descending (largest value gets smallest rank). A is the default.

Tags:

r

dplyr

ranking

rank

This is a small example. In my larger dataset, I have multiple years of data and the number of observations per group (div) are not always equal.

Example data:

set.seed(1)
df<-data.frame(
  year = 2014,
  id = sample(LETTERS[1:26], 12),
  div = rep(c("1", "2a", "2b"), each=4),
  pts = c(9,7,9,3,7,5,3,7,2,7,7,1),
  x = c(10,12,11,7,7,5,4,12,4,6,7,2)
)

#   year id div pts  x
#1  2014  G   1   9 10
#2  2014  J   1   7 12
#3  2014  N   1   9 11
#4  2014  U   1   3  7
#5  2014  E  2a   7  7
#6  2014  S  2a   5  5
#7  2014  W  2a   3  4
#8  2014  M  2a   7 12
#9  2014  L  2b   2  4
#10 2014  B  2b   7  6
#11 2014  D  2b   7  7
#12 2014  C  2b   1  2

I want to rank this data such that individuals in div 1 are ranked higher than div 2a/2b, and within div 1 individuals are ranked 1,2,3,4 based on highest number of 'pts' followed by highest number of 'x'.

Individuals in div 2a and div 2b should be ranked individually also based on the same criteria. This would look like this:

df %>% 
  group_by(div) %>%
  arrange(desc(pts), desc(x)) %>%
  mutate(position = row_number(div))


#   year id div pts  x position
#1  2014  N   1   9 11        1
#2  2014  G   1   9 10        2
#3  2014  J   1   7 12        3
#4  2014  U   1   3  7        4
#5  2014  M  2a   7 12        1
#6  2014  E  2a   7  7        2
#7  2014  S  2a   5  5        3
#8  2014  W  2a   3  4        4
#9  2014  D  2b   7  7        1
#10 2014  B  2b   7  6        2
#11 2014  L  2b   2  4        3
#12 2014  C  2b   1  2        4

However, I want to produce a final column/variable that is another rank. This would rank all individuals in div 1 as higher than 2a/2b, but 2a/2b are equal. i.e. individuals who are 1 in 2a/2b should now get 5.5, individuals who are ranked 2 should now get 7.5. There are always an equal number of individuals in div2a and div2b for all years.

It should look like this:

#   year id div pts  x position final
#1  2014  N   1   9 11        1   1.0  
#2  2014  G   1   9 10        2   2.0
#3  2014  J   1   7 12        3   3.0
#4  2014  U   1   3  7        4   4.0
#5  2014  M  2a   7 12        1   5.5
#6  2014  E  2a   7  7        2   7.5
#7  2014  S  2a   5  5        3   9.5
#8  2014  W  2a   3  4        4  11.5
#9  2014  D  2b   7  7        1   5.5
#10 2014  B  2b   7  6        2   7.5  
#11 2014  L  2b   2  4        3   9.5
#12 2014  C  2b   1  2        4  11.5

I need to find a dplyr solution ideally. Also, it does need to generalize to years where the number of individuals in 'div1' may vary and the number of individuals in div2a/div2b varies (although length(div2a)==length(div2b) always).

570

asked Feb 18 '15 16:02

jalapic

2 Answers

This is how I'd do it:

library(data.table)
dt = as.data.table(df)

dt[order(-pts, -x), rank.init := 1:.N, by = div]

dt[, div.clean := sub('(\\d+).*', '\\1', div)]
setorder(dt, div.clean, rank.init)

dt[, rank.final := mean(.I), by = .(div.clean, rank.init)]
setorder(dt, div, rank.final)
#    year id div pts  x rank.init div.clean rank.final
# 1: 2014  N   1   9 11         1         1        1.0
# 2: 2014  G   1   9 10         2         1        2.0
# 3: 2014  J   1   7 12         3         1        3.0
# 4: 2014  U   1   3  7         4         1        4.0
# 5: 2014  M  2a   7 12         1         2        5.5
# 6: 2014  E  2a   7  7         2         2        7.5
# 7: 2014  S  2a   5  5         3         2        9.5
# 8: 2014  W  2a   3  4         4         2       11.5
# 9: 2014  D  2b   7  7         1         2        5.5
#10: 2014  B  2b   7  6         2         2        7.5
#11: 2014  L  2b   2  4         3         2        9.5
#12: 2014  C  2b   1  2         4         2       11.5

170

answered Nov 09 '22 15:11

eddi

@eddi's answer is already very nice. I just wanted to illustrate the same using frank() function from the development version of data.table, v1.9.5, which can compute ranks on vectors, lists, data.frames or data.tables.

# from @eddi's
setDT(df)[, div.clean := sub('(\\d+).*', '\\1', div)]

df[, position := frank(.SD, -pts, -x, ties.method="first"), by=div]
df[, final := frank(.SD, div.clean, position, ties.method="average")]

This also retains the original order, if that's of any importance.

I'll leave the conversion to dplyr to you.

answered Nov 09 '22 16:11

Arun

Related questions
                            
                                R data.table loop subset by factor and do lm()
                            
                                Is there a way to update existing text in the R console?
                            
                                arrange multiple graphs using a for loop in ggplot2
                            
                                R computing mean, median, variance from file with frequency distribution
                            
                                Calling setdiff() on multiple vectors
                            
                                Error in using the predict() function
                            
                                cbind replaces String with number?
                            
                                Matrix diagram in r
                            
                                Apply grouped model back onto data
                            
                                How do I get SQL database into R from local host?
                            
                                One line if statement in R - invalid first argument
                            
                                What is the equivalent of SQL's IN keyword in R?
                            
                                How to order the levels of factors according to the ordering of a data.frame (and not alphabetically)
                            
                                Why rbind throws a warning
                            
                                Self reference when indexing into a vector
                            
                                multiple ggplot linear regression lines
                            
                                Check frequency of data.table value in other data.table
                            
                                add column with row wise mean over selected columns using dplyr
                            
                                Creating edge list in R
                            
                                Trying to understand R structure: what does a dot in function names signify?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With