I have data on multiple projects, where each project is composed of a team of individuals. I have information on their rank and their gender, like so:
df <- read.table(header = TRUE, text = 'Project GenderA RankA GenderB RankB GenderC RankC
100 1 3 0 1 1 2
200 1 2 1 2 NA NA
300 0 3 1 3 0 2
400 0 1 NA NA NA NA
500 1 1 0 2 1 1')
For each project, I want R to create a new variable with the highest rank on the team and that person's gender.
Project GenderA RankA GenderB RankB GenderC RankC HighGen HighRank
100 1 3 0 1 1 2 1 3
I can obtain the highest rank with pmax:
df<-cbind(df,
HighRank = pmax(df$RankA, df$RankB, df$RankC))
But this works only for projects that are complete (that is, with no NAs), and is not associated with the gender column. What is the best way to do both?
This is an opinionated answer, but the reason that this is hard to do in one go is that your data is not tidy. That means in this case that you want each row to have one observation. I give the 'tidy' representation and a solution using the dplyr
package:
library(dplyr)
df <- read.table(header = TRUE, text = 'Project Gender Rank order
100 1 3 A
100 0 1 B
100 1 2 C
200 1 2 A
200 1 2 B
200 NA NA C
300 0 3 A
300 1 3 B
300 0 2 C')
df %>% group_by(Project) %>% arrange(-Rank) %>% slice(1) %>% ungroup()
To explain that last line, you group by project, sort all scores in those projects by rank, then filter for the first, and finally, for good measure, you ungroup, meaning you 'forget' the grouping. Your output is then the highest scoring record for each project, including it's score, gender. Note that I also put in a new variable 'order', to reflect the fact that data was scored under 'rankA' or 'rankB' etcetera.
Out:
| Project| Gender| Rank|order |
|-------:|------:|----:|:-----|
| 100| 1| 3|A |
| 300| 0| 3|A |
| 200| 1| 2|A |
Separating the data into tidy representation can be done as well, but you may want to look at how you've entered the data in the first place. That's it for the tidyness, but for an extensive reading you can look at http://vita.had.co.nz/papers/tidy-data.pdf.
Thanks to the comment of Gregor, here is a way to convert your dataframe into a more tidy representation (using the package tidyr
)
library(tidyr)
df <- read.table(header = TRUE, text = 'Project GenderA RankA GenderB RankB GenderC RankC
100 1 3 0 1 1 2
200 1 2 1 2 NA NA
300 0 3 1 3 0 2
400 0 1 NA NA NA NA
500 1 1 0 2 1 1')
df %>% gather(key = key, value = value, -Project) %>% separate(key, into = c("variable", "order"), sep = -2) %>% spread(key = variable, value = value)
Out:
| Project|order | Gender| Rank|
|-------:|:-----|------:|----:|
| 100|A | 1| 3|
| 100|B | 0| 1|
| 100|C | 1| 2|
| 200|A | 1| 2|
| 200|B | 1| 2|
| 200|C | NA| NA|
| 300|A | 0| 3|
| 300|B | 1| 3|
| 300|C | 0| 2|
| 400|A | 0| 1|
| 400|B | NA| NA|
| 400|C | NA| NA|
| 500|A | 1| 1|
| 500|B | 0| 2|
| 500|C | 1| 1|
You can then feed this to the formula in the first part to get your solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With