Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Return highest value in vector and string in related column in R

Tags:

r

max

I have data on multiple projects, where each project is composed of a team of individuals. I have information on their rank and their gender, like so:

df <- read.table(header = TRUE, text = 'Project GenderA RankA GenderB RankB GenderC RankC
             100 1 3 0 1 1 2
             200 1 2 1 2 NA NA
             300 0 3 1 3 0 2
             400 0 1 NA NA NA NA
             500 1 1 0 2 1 1')

For each project, I want R to create a new variable with the highest rank on the team and that person's gender.

  Project GenderA RankA GenderB RankB GenderC RankC HighGen HighRank
  100     1       3     0       1     1       2     1       3

I can obtain the highest rank with pmax:

df<-cbind(df,
  HighRank = pmax(df$RankA, df$RankB, df$RankC))

But this works only for projects that are complete (that is, with no NAs), and is not associated with the gender column. What is the best way to do both?

like image 675
ricatom Avatar asked Jan 17 '17 20:01

ricatom


1 Answers

This is an opinionated answer, but the reason that this is hard to do in one go is that your data is not tidy. That means in this case that you want each row to have one observation. I give the 'tidy' representation and a solution using the dplyr package:

library(dplyr)

df <- read.table(header = TRUE, text = 'Project Gender Rank order
             100  1 3 A 
             100  0 1 B
             100  1 2 C
             200  1 2 A 
             200  1 2 B
             200  NA NA C
             300  0 3 A
             300  1 3 B
             300  0 2 C')

df %>% group_by(Project) %>% arrange(-Rank) %>% slice(1) %>% ungroup()

To explain that last line, you group by project, sort all scores in those projects by rank, then filter for the first, and finally, for good measure, you ungroup, meaning you 'forget' the grouping. Your output is then the highest scoring record for each project, including it's score, gender. Note that I also put in a new variable 'order', to reflect the fact that data was scored under 'rankA' or 'rankB' etcetera.

Out:

| Project| Gender| Rank|order |
|-------:|------:|----:|:-----|
|     100|      1|    3|A     |
|     300|      0|    3|A     |
|     200|      1|    2|A     |

Separating the data into tidy representation can be done as well, but you may want to look at how you've entered the data in the first place. That's it for the tidyness, but for an extensive reading you can look at http://vita.had.co.nz/papers/tidy-data.pdf.

Thanks to the comment of Gregor, here is a way to convert your dataframe into a more tidy representation (using the package tidyr)

library(tidyr)


df <- read.table(header = TRUE, text = 'Project GenderA RankA GenderB RankB GenderC RankC
             100 1 3 0 1 1 2
             200 1 2 1 2 NA NA
             300 0 3 1 3 0 2
             400 0 1 NA NA NA NA
             500 1 1 0 2 1 1')

df %>% gather(key = key, value = value, -Project) %>% separate(key, into = c("variable", "order"), sep = -2) %>% spread(key = variable, value = value)

Out:

| Project|order | Gender| Rank|
|-------:|:-----|------:|----:|
|     100|A     |      1|    3|
|     100|B     |      0|    1|
|     100|C     |      1|    2|
|     200|A     |      1|    2|
|     200|B     |      1|    2|
|     200|C     |     NA|   NA|
|     300|A     |      0|    3|
|     300|B     |      1|    3|
|     300|C     |      0|    2|
|     400|A     |      0|    1|
|     400|B     |     NA|   NA|
|     400|C     |     NA|   NA|
|     500|A     |      1|    1|
|     500|B     |      0|    2|
|     500|C     |      1|    1|

You can then feed this to the formula in the first part to get your solution.

like image 128
Gijs Avatar answered Nov 07 '22 02:11

Gijs