Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to summarize the top n values across multiple columns row wise?

Tags:

r

tidyverse

In my dataframe, I have multiple columns with student grades. I want to sum the "Quiz" columns (e.g., Quiz1, Quiz2). However, I only want to sum the top 2 values, and ignore the others. I want to create a new column with the total (i.e., the sum of the top 2 values).

One issue is that some students have grades that tie for the top 2 grades in a given row. For example, Aaron has a high score of 42, but then there are two scores that tie for the second highest (i.e., 36).

Data

df <- 
  structure(
  list(
    Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia", 
                "Timothy", "Grant", "Chloe", "Judy", "Justin"),
    ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096, 
           98432, 19487, 94029),
    Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
    Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
    Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
    Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
  ),
  row.names = c(NA, -10L),
  class = c("tbl_df", "tbl", "data.frame")
)

I know that I can use pivot_longer to do this, which allows me to arrange by group, then take the top 2 values for each student. This works fine, but I would like a more efficient way with tidyverse, rather than having to pivot back and forth.

What I Tried

library(tidyverse)

df %>%
  pivot_longer(-c(Student, ID)) %>%
  group_by(Student, ID) %>%
  arrange(desc(value), .by_group = TRUE) %>%
  slice_head(n = 2) %>%
  pivot_wider(names_from = name, values_from = value) %>%
  ungroup() %>%
  mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE))

I also know that if I wanted to sum all the columns on each row, then I could use rowSums, as I made use of above. However, I am unsure how to do rowSums of just the top 2 values in the 4 quiz columns.

Expected Output

# A tibble: 10 × 7
   Student      ID Quiz2 Quiz3 Quiz1 Quiz4 Total
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    42    36    NA    NA    78
 2 Charlotte 61755    NA    NA    41    39    80
 3 Chloe     98432    49    NA    42    NA    91
 4 Grant     34096    48    44    NA    NA    92
 5 James     87311    NA    36    NA    43    79
 6 Judy      19487    NA    42    NA    41    83
 7 Justin    94029    NA    37    NA    41    78
 8 Katie     55323    NA    32    NA    46    78
 9 Olivia    94839    NA    43    NA    40    83
10 Timothy   38209    38    38    NA    NA    76
like image 336
AndrewGB Avatar asked Dec 12 '21 22:12

AndrewGB


People also ask

How do you find the top 3 values in R?

To get the top values in an R data frame, we can use the head function and if we want the values in decreasing order then sort function will be required. Therefore, we need to use the combination of head and sort function to find the top values in decreasing order.

How do you calculate row wise sum in R?

Row wise sum of the dataframe using dplyr: Method 1 rowSums() function takes up the columns 2 to 4 and performs the row wise operation with NA values replaced to zero. row wise sum is performed using pipe (%>%) operator of the dplyr package.

How to pick columns by name and function?

You can pick columns by position, name, function of name, type, or any combination thereof using Boolean operators. .fns: Function or list of functions to apply to each column.

How do I compute summary statistics across multiple Numeric columns?

This article describes how to compute summary statistics, such as mean, sd, quantiles, across multiple numeric columns. The dplyr package [v>= 1.0.0] is required. We’ll use the function across () to make computation across multiple columns. .cols: Columns you want to operate on.

What happens when topn returns more than n rows?

Then, when there are ties at the N-th row the function might return more than n rows. If n_value is 0 (zero) or less, TOPN returns an empty table.

How to make computation across multiple columns in a MySQL database?

We’ll use the function across () to make computation across multiple columns. .cols: Columns you want to operate on. You can pick columns by position, name, function of name, type, or any combination thereof using Boolean operators. .fns: Function or list of functions to apply to each column. ...:


Video Answer


7 Answers

Based on this StackOverflow answer.

library(tidyverse)

df <- 
  structure(
    list(
      Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia", 
                  "Timothy", "Grant", "Chloe", "Judy", "Justin"),
      ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096, 
             98432, 19487, 94029),
      Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
      Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
      Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
      Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
    ),
    row.names = c(NA, -10L),
    class = c("tbl_df", "tbl", "data.frame")
  )

df %>%
  rowwise() %>% 
  mutate(Quiz_Total = sum(sort(c(Quiz1,Quiz2,Quiz3,Quiz4), decreasing = TRUE)[1:2])) %>% 
  ungroup()
#> # A tibble: 10 × 7
#>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Quiz_Total
#>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
#>  1 Aaron     30016    31    42    36    36         78
#>  2 James     87311    25    33    36    43         79
#>  3 Charlotte 61755    41    34    34    39         80
#>  4 Katie     55323    10    22    32    46         78
#>  5 Olivia    94839    35    23    43    40         83
#>  6 Timothy   38209    19    38    38    38         76
#>  7 Grant     34096    27    48    44    43         92
#>  8 Chloe     98432    42    49    42    35         91
#>  9 Judy      19487    15    23    42    41         83
#> 10 Justin    94029    20    30    37    41         78
like image 189
jrcalabrese Avatar answered Oct 19 '22 20:10

jrcalabrese


with base R - select just the quiz result columns and you can treat it like a matrix. apply sort in decreasing order, subsetting first two elements, and then use colSums.

df$Total <- colSums(apply(df[grepl("Quiz", names(df))], 1, function(x) sort(x, decreasing = TRUE)[1:2]))

df
#> # A tibble: 10 × 7
#>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
#>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Aaron     30016    31    42    36    36    78
#>  2 James     87311    25    33    36    43    79
#>  3 Charlotte 61755    41    34    34    39    80
#>  4 Katie     55323    10    22    32    46    78
#>  5 Olivia    94839    35    23    43    40    83
#>  6 Timothy   38209    19    38    38    38    76
#>  7 Grant     34096    27    48    44    43    92
#>  8 Chloe     98432    42    49    42    35    91
#>  9 Judy      19487    15    23    42    41    83
#> 10 Justin    94029    20    30    37    41    78
like image 4
tjebo Avatar answered Oct 19 '22 20:10

tjebo


You do not have to do pivot_wider. Note that the longer format is the tidy format. Just do pivot_longer and left_join:

df %>% 
  left_join(pivot_longer(., -c(Student, ID)) %>%
  group_by(Student, ID) %>%
  summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop'))

# A tibble: 10 x 7
   Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    31    42    36    36    78
 2 James     87311    25    33    36    43    79
 3 Charlotte 61755    41    34    34    39    80
 4 Katie     55323    10    22    32    46    78
 5 Olivia    94839    35    23    43    40    83
 6 Timothy   38209    19    38    38    38    76
 7 Grant     34096    27    48    44    43    92
 8 Chloe     98432    42    49    42    35    91
 9 Judy      19487    15    23    42    41    83
10 Justin    94029    20    30    37    41    78
like image 3
KU99 Avatar answered Oct 19 '22 18:10

KU99


Yet another solution, based on tidyverse:

library(tidyverse)

df %>% 
  rowwise %>% 
  mutate(Quiz = list(c_across(starts_with("Quiz")) * 
   if_else(rank(c_across(starts_with("Quiz")),ties.method="last")>=3,1,NA_real_)),
    across(matches("\\d$"), ~ NULL), total = sum(Quiz, na.rm = T)) %>%
  unnest_wider(Quiz, names_sep = "") 

#> # A tibble: 10 × 7
#>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 total
#>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Aaron     30016    NA    42    36    NA    78
#>  2 James     87311    NA    NA    36    43    79
#>  3 Charlotte 61755    41    NA    NA    39    80
#>  4 Katie     55323    NA    NA    32    46    78
#>  5 Olivia    94839    NA    NA    43    40    83
#>  6 Timothy   38209    NA    38    38    NA    76
#>  7 Grant     34096    NA    48    44    NA    92
#>  8 Chloe     98432    42    49    NA    NA    91
#>  9 Judy      19487    NA    NA    42    41    83
#> 10 Justin    94029    NA    NA    37    41    78
like image 2
PaulS Avatar answered Oct 19 '22 20:10

PaulS


As @akrun provided above, collapse is another efficient possibility. radixorder provides an integer ordering vector, and only the top 2 values in each row are kept, while the others are replaced with NA. Then, rowSums is used to get the totals for each row.

library(collapse)

ftransform(gvr(df, "Student|ID"),
           dapply(
             gvr(df, "^Quiz"),
             MARGIN = 1,
             FUN = function(x)
               replace(x, radixorder(radixorder(x)) %in% 1:2, NA)
           )) %>%
  ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE))

Output

# A tibble: 10 × 7
   Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
 * <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    NA    42    NA    36    78
 2 James     87311    NA    NA    36    43    79
 3 Charlotte 61755    41    NA    NA    39    80
 4 Katie     55323    NA    NA    32    46    78
 5 Olivia    94839    NA    NA    43    40    83
 6 Timothy   38209    NA    NA    38    38    76
 7 Grant     34096    NA    48    44    NA    92
 8 Chloe     98432    NA    49    42    NA    91
 9 Judy      19487    NA    NA    42    41    83
10 Justin    94029    NA    NA    37    41    78
like image 1
AndrewGB Avatar answered Oct 19 '22 18:10

AndrewGB


(A bit messy) Base R Solution:

# Store the names of quiz columns as a vector: quiz_colnames => character vector
quiz_colnames <- grep("Quiz\\d+", names(df), value = TRUE)

# Store the names of the non-quiz columns as a vector: non_quiz_colnames => character vector
non_quiz_colnames <- names(df)[!(names(df) %in% quiz_colnames)]

# Store an Idx based on the ID: Idx => integer vector:
Idx <- with(df, as.integer(factor(ID, levels = unique(ID))))

# Split-Apply-Combine to calculate the top 2 quizes: res => data.frame
res <- data.frame(
  do.call(
    rbind,
    lapply(
      with(
        df,
        split(
          df,
          Idx 
        )
      ),
      function(x){
        # Extract the top 2 quiz vectors: top_2_quizes => named integer vector
        top_2_quizes <- head(sort(unlist(x[,quiz_colnames]), decreasing = TRUE), 2)
        # Calculate the quiz columns not used: remainder_quiz_cols => character vector
        remainder_quiz_cols <- quiz_colnames[!(quiz_colnames %in% names(top_2_quizes))]
        # Nullify the remaining quizes: x => data.frame 
        x[, remainder_quiz_cols] <- NA_integer_
        # Calculate the resulting data.frame: data.frame => env 
        transform(
          cbind(
            x[,non_quiz_names], 
            x[,names(top_2_quizes)],
            x[,remainder_quiz_cols]
          ),
          Total = sum(top_2_quizes)
        )[,c(non_quiz_names, "Quiz2", "Quiz3", "Quiz1", "Quiz4", "Total")]
      }
    )
  ),
  row.names = NULL,
  stringsAsFactors = FALSE
)
like image 1
hello_friend Avatar answered Oct 19 '22 20:10

hello_friend


Try this base R to also get the NAs

cbind( df[,1:2], t( sapply( seq_along( 1:nrow( df ) ), function(x){
  ord <- order( unlist( df[x,3:6] ) )[1:2]; arow <- df[x,3:6]; 
  arow[ord] <- NA; ttl <- rowSums( arow[-ord], na.rm=T );
  cbind( arow,Total=ttl ) } ) ) )

     Student    ID Quiz1 Quiz2 Quiz3 Quiz4 Total
1      Aaron 30016    NA    42    NA    36    78
2      James 87311    NA    NA    36    43    79
3  Charlotte 61755    41    NA    NA    39    80
4      Katie 55323    NA    NA    32    46    78
5     Olivia 94839    NA    NA    43    40    83
6    Timothy 38209    NA    NA    38    38    76
7      Grant 34096    NA    48    44    NA    92
8      Chloe 98432    NA    49    42    NA    91
9       Judy 19487    NA    NA    42    41    83
10    Justin 94029    NA    NA    37    41    78
like image 1
Andre Wildberg Avatar answered Oct 19 '22 19:10

Andre Wildberg