Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Max value of for each column for distinct rows of data frame in r

Let’s say there is an ordered df with an ID column, and other columns containing numeral data, ordered by the last column.

ID <- c(123, 142, 21, 562, 36, 721, 847, 321)
A <- c(96, 83, 73, 47, 88, 65, 72, 67)
B <- c(72, 69, 88, 75, 63, 89, 48, 80)
C <- c(95, 94, 94, 94, 65, 81, 75, 75)
D <- c(63, 88, 89, 88, 89, 79, 88, 79)
Rating <- c(97, 95, 92, 87, 85, 83, 79, 77)
df <- data.frame(ID, A, B, C, D, Rating)
df
#   ID  A  B  C  D Rating
#1 123 96 72 95 63     97
#2 142 83 69 94 88     95
#3  21 73 88 94 89     92
#4 562 47 75 94 88     87
#5  36 88 63 65 89     85
#6 721 65 89 81 79     83
#7 847 72 48 75 88     79
#8 321 67 80 75 79     77

The aim is to get the max value for each group/column, with its ID, and each pair needs to be from a distinct row (unique ID). For two IDs with the same value for a column, pick the one with the better Rating.

What I did is use the apply() function to get the max from each column, extract the IDs that have that value, and join them all into a data frame. Because I was still missing an ID for the 4th column, I used an anti join to take out the previous IDs and repeated the process to get this data frame:

my_max <- data.frame(apply(df, 2, max))
A2 <- df[which(df$A == my_max[2,1]),]%>% dplyr::select(ID, A)
B2 <- df[which(df$B == my_max[3,1]),]%>% dplyr::select(ID, B)
C2 <- df[which(df$C == my_max[4,1]),]%>% dplyr::select(ID, C)
D2 <- df[which(df$D == my_max[5,1]),]%>% dplyr::select(ID, D)
all <- full_join(A2, B2, by='ID') %>% full_join(C2, by='ID') %>% full_join(D2, by='ID')
all <- all[-c(4),]
df <- anti_join(df, all, by='ID')
my_max <- data.frame(apply(df, 2, max))
C2 <- df[which(df$C == my_max[4,1]),]%>% dplyr::select(ID, C)
all <- all %>% full_join(C2, by='ID')
all <- all[-c(5),-c(4)]

To finally give me:

all
#   ID  A  B  D C.y
#1 123 96 NA NA  NA
#2 721 NA 89 NA  NA
#3  21 NA NA 89  NA
#4 142 NA NA NA  94

Is there a more clean or concise/efficient way of doing this? Not necessarily the same way, perhaps just the ID and role like:

#   ID  Group
#1 123    A
#2 721    B
#3 142    C
#4 21     D
like image 455
Ibrahim B. Avatar asked Jan 01 '23 01:01

Ibrahim B.


1 Answers

I see that some solutions do not handle duplicated IDs. For instance, we have ID 123 for both groups A and C.

To get an output similar to the final results in your question, another solution that handles duplicated IDs is the following

# initialization
variables <- c("A", "B", "C", "D")
df_max <- data.frame(ID = numeric(length(variables)), Group = variables)

for(column in variables){
  temp_id <- df %>% 
    filter(!(ID %in% df_max$ID)) %>% 
    arrange(desc(!!rlang::sym(column)), desc(Rating)) %>% 
    slice(1) %>% 
    select(ID) %>%
    as.numeric(ID)
  df_max[df_max$Group == column, "ID"] <- temp_id
}

Basically, the filter step makes sure that we do not consider the already selected IDs.

Output

# > df_max
#
#    ID Group
# 1 123     A
# 2 721     B
# 3 142     C
# 4  21     D
like image 182
Ric S Avatar answered Jan 13 '23 20:01

Ric S