In my dataframe, I have multiple columns with student grades. I want to sum the "Quiz" columns (e.g., Quiz1, Quiz2). However, I only want to sum the top 2 values, and ignore the others. I want to create a new column with the total (i.e., the sum of the top 2 values). One issue is that some students have grades that tie for the top 2 grades in a given row. For example, Aaron has a high score of 42, but then there are two scores that tie for the second highest (i.e., 36). Data <pre class="prettyprint"><code>df <- structure( list( Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia", "Timothy", "Grant", "Chloe", "Judy", "Justin"), ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096, 98432, 19487, 94029), Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20), Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30), Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37), Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41) ), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame") ) </code></pre> I know that I can use <code>pivot_longer</code> to do this, which allows me to arrange by group, then take the top 2 values for each student. This works fine, but I would like a more efficient way with <code>tidyverse</code>, rather than having to pivot back and forth. What I Tried <pre class="prettyprint"><code>library(tidyverse) df %>% pivot_longer(-c(Student, ID)) %>% group_by(Student, ID) %>% arrange(desc(value), .by_group = TRUE) %>% slice_head(n = 2) %>% pivot_wider(names_from = name, values_from = value) %>% ungroup() %>% mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE)) </code></pre> I also know that if I wanted to sum all the columns on each row, then I could use <code>rowSums</code>, as I made use of above. However, I am unsure how to do <code>rowSums</code> of just the top 2 values in the 4 quiz columns. Expected Output <pre class="prettyprint"><code># A tibble: 10 × 7 Student ID Quiz2 Quiz3 Quiz1 Quiz4 Total <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Aaron 30016 42 36 NA NA 78 2 Charlotte 61755 NA NA 41 39 80 3 Chloe 98432 49 NA 42 NA 91 4 Grant 34096 48 44 NA NA 92 5 James 87311 NA 36 NA 43 79 6 Judy 19487 NA 42 NA 41 83 7 Justin 94029 NA 37 NA 41 78 8 Katie 55323 NA 32 NA 46 78 9 Olivia 94839 NA 43 NA 40 83 10 Timothy 38209 38 38 NA NA 76 </code></pre>

with base R - select just the quiz result columns and you can treat it like a matrix. apply sort in decreasing order, subsetting first two elements, and then use colSums. <pre class="prettyprint lang-r prettyprint-override"><code>df$Total <- colSums(apply(df[grepl("Quiz", names(df))], 1, function(x) sort(x, decreasing = TRUE)[1:2])) df #> # A tibble: 10 × 7 #> Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Aaron 30016 31 42 36 36 78 #> 2 James 87311 25 33 36 43 79 #> 3 Charlotte 61755 41 34 34 39 80 #> 4 Katie 55323 10 22 32 46 78 #> 5 Olivia 94839 35 23 43 40 83 #> 6 Timothy 38209 19 38 38 38 76 #> 7 Grant 34096 27 48 44 43 92 #> 8 Chloe 98432 42 49 42 35 91 #> 9 Judy 19487 15 23 42 41 83 #> 10 Justin 94029 20 30 37 41 78 </code></pre>

You do not have to do <code>pivot_wider</code>. Note that the longer format is the tidy format. Just do <code>pivot_longer</code> and <code>left_join</code>: <pre class="prettyprint"><code>df %>% left_join(pivot_longer(., -c(Student, ID)) %>% group_by(Student, ID) %>% summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop')) # A tibble: 10 x 7 Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Aaron 30016 31 42 36 36 78 2 James 87311 25 33 36 43 79 3 Charlotte 61755 41 34 34 39 80 4 Katie 55323 10 22 32 46 78 5 Olivia 94839 35 23 43 40 83 6 Timothy 38209 19 38 38 38 76 7 Grant 34096 27 48 44 43 92 8 Chloe 98432 42 49 42 35 91 9 Judy 19487 15 23 42 41 83 10 Justin 94029 20 30 37 41 78 </code></pre>

Yet another solution, based on <code>tidyverse</code>: <pre class="prettyprint lang-r prettyprint-override"><code>library(tidyverse) df %>% rowwise %>% mutate(Quiz = list(c_across(starts_with("Quiz")) * if_else(rank(c_across(starts_with("Quiz")),ties.method="last")>=3,1,NA_real_)), across(matches("\\d$"), ~ NULL), total = sum(Quiz, na.rm = T)) %>% unnest_wider(Quiz, names_sep = "") #> # A tibble: 10 × 7 #> Student ID Quiz1 Quiz2 Quiz3 Quiz4 total #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Aaron 30016 NA 42 36 NA 78 #> 2 James 87311 NA NA 36 43 79 #> 3 Charlotte 61755 41 NA NA 39 80 #> 4 Katie 55323 NA NA 32 46 78 #> 5 Olivia 94839 NA NA 43 40 83 #> 6 Timothy 38209 NA 38 38 NA 76 #> 7 Grant 34096 NA 48 44 NA 92 #> 8 Chloe 98432 42 49 NA NA 91 #> 9 Judy 19487 NA NA 42 41 83 #> 10 Justin 94029 NA NA 37 41 78 </code></pre>

As @akrun provided above, <code>collapse</code> is another efficient possibility. <code>radixorder</code> provides an integer ordering vector, and only the top 2 values in each row are kept, while the others are replaced with <code>NA</code>. Then, <code>rowSums</code> is used to get the totals for each row. <pre class="prettyprint"><code>library(collapse) ftransform(gvr(df, "Student|ID"), dapply( gvr(df, "^Quiz"), MARGIN = 1, FUN = function(x) replace(x, radixorder(radixorder(x)) %in% 1:2, NA) )) %>% ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE)) </code></pre> Output <pre class="prettyprint"><code># A tibble: 10 × 7 Student ID Quiz1 Quiz2 Quiz3 Quiz4 Total * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Aaron 30016 NA 42 NA 36 78 2 James 87311 NA NA 36 43 79 3 Charlotte 61755 41 NA NA 39 80 4 Katie 55323 NA NA 32 46 78 5 Olivia 94839 NA NA 43 40 83 6 Timothy 38209 NA NA 38 38 76 7 Grant 34096 NA 48 44 NA 92 8 Chloe 98432 NA 49 42 NA 91 9 Judy 19487 NA NA 42 41 83 10 Justin 94029 NA NA 37 41 78 </code></pre>

(A bit messy) Base R Solution: <pre class="prettyprint"><code># Store the names of quiz columns as a vector: quiz_colnames => character vector quiz_colnames <- grep("Quiz\\d+", names(df), value = TRUE) # Store the names of the non-quiz columns as a vector: non_quiz_colnames => character vector non_quiz_colnames <- names(df)[!(names(df) %in% quiz_colnames)] # Store an Idx based on the ID: Idx => integer vector: Idx <- with(df, as.integer(factor(ID, levels = unique(ID)))) # Split-Apply-Combine to calculate the top 2 quizes: res => data.frame res <- data.frame( do.call( rbind, lapply( with( df, split( df, Idx ) ), function(x){ # Extract the top 2 quiz vectors: top_2_quizes => named integer vector top_2_quizes <- head(sort(unlist(x[,quiz_colnames]), decreasing = TRUE), 2) # Calculate the quiz columns not used: remainder_quiz_cols => character vector remainder_quiz_cols <- quiz_colnames[!(quiz_colnames %in% names(top_2_quizes))] # Nullify the remaining quizes: x => data.frame x[, remainder_quiz_cols] <- NA_integer_ # Calculate the resulting data.frame: data.frame => env transform( cbind( x[,non_quiz_names], x[,names(top_2_quizes)], x[,remainder_quiz_cols] ), Total = sum(top_2_quizes) )[,c(non_quiz_names, "Quiz2", "Quiz3", "Quiz1", "Quiz4", "Total")] } ) ), row.names = NULL, stringsAsFactors = FALSE ) </code></pre>

How to summarize the top n values across multiple columns row wise?

Tags:

r

tidyverse

In my dataframe, I have multiple columns with student grades. I want to sum the "Quiz" columns (e.g., Quiz1, Quiz2). However, I only want to sum the top 2 values, and ignore the others. I want to create a new column with the total (i.e., the sum of the top 2 values).

One issue is that some students have grades that tie for the top 2 grades in a given row. For example, Aaron has a high score of 42, but then there are two scores that tie for the second highest (i.e., 36).

Data

Click to copy

df <- 
  structure(
  list(
    Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia", 
                "Timothy", "Grant", "Chloe", "Judy", "Justin"),
    ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096, 
           98432, 19487, 94029),
    Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
    Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
    Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
    Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
  ),
  row.names = c(NA, -10L),
  class = c("tbl_df", "tbl", "data.frame")
)

I know that I can use pivot_longer to do this, which allows me to arrange by group, then take the top 2 values for each student. This works fine, but I would like a more efficient way with tidyverse, rather than having to pivot back and forth.

What I Tried

Click to copy

library(tidyverse)

df %>%
  pivot_longer(-c(Student, ID)) %>%
  group_by(Student, ID) %>%
  arrange(desc(value), .by_group = TRUE) %>%
  slice_head(n = 2) %>%
  pivot_wider(names_from = name, values_from = value) %>%
  ungroup() %>%
  mutate(Total = rowSums(select(., starts_with("Quiz")), na.rm = TRUE))

I also know that if I wanted to sum all the columns on each row, then I could use rowSums, as I made use of above. However, I am unsure how to do rowSums of just the top 2 values in the 4 quiz columns.

Expected Output

Click to copy

# A tibble: 10 × 7
   Student      ID Quiz2 Quiz3 Quiz1 Quiz4 Total
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    42    36    NA    NA    78
 2 Charlotte 61755    NA    NA    41    39    80
 3 Chloe     98432    49    NA    42    NA    91
 4 Grant     34096    48    44    NA    NA    92
 5 James     87311    NA    36    NA    43    79
 6 Judy      19487    NA    42    NA    41    83
 7 Justin    94029    NA    37    NA    41    78
 8 Katie     55323    NA    32    NA    46    78
 9 Olivia    94839    NA    43    NA    40    83
10 Timothy   38209    38    38    NA    NA    76

336

asked Dec 12 '21 22:12

AndrewGB

Video Answer

7 Answers

Based on this StackOverflow answer.

Click to copy

library(tidyverse)

df <- 
  structure(
    list(
      Student = c("Aaron", "James", "Charlotte", "Katie", "Olivia", 
                  "Timothy", "Grant", "Chloe", "Judy", "Justin"),
      ID = c(30016, 87311, 61755, 55323, 94839, 38209, 34096, 
             98432, 19487, 94029),
      Quiz1 = c(31, 25, 41, 10, 35, 19, 27, 42, 15, 20),
      Quiz2 = c(42, 33, 34, 22, 23, 38, 48, 49, 23, 30),
      Quiz3 = c(36, 36, 34, 32, 43, 38, 44, 42, 42, 37),
      Quiz4 = c(36, 43, 39, 46, 40, 38, 43, 35, 41, 41)
    ),
    row.names = c(NA, -10L),
    class = c("tbl_df", "tbl", "data.frame")
  )

df %>%
  rowwise() %>% 
  mutate(Quiz_Total = sum(sort(c(Quiz1,Quiz2,Quiz3,Quiz4), decreasing = TRUE)[1:2])) %>% 
  ungroup()
#> # A tibble: 10 × 7
#>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Quiz_Total
#>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
#>  1 Aaron     30016    31    42    36    36         78
#>  2 James     87311    25    33    36    43         79
#>  3 Charlotte 61755    41    34    34    39         80
#>  4 Katie     55323    10    22    32    46         78
#>  5 Olivia    94839    35    23    43    40         83
#>  6 Timothy   38209    19    38    38    38         76
#>  7 Grant     34096    27    48    44    43         92
#>  8 Chloe     98432    42    49    42    35         91
#>  9 Judy      19487    15    23    42    41         83
#> 10 Justin    94029    20    30    37    41         78

189

answered Oct 19 '22 20:10

jrcalabrese

with base R - select just the quiz result columns and you can treat it like a matrix. apply sort in decreasing order, subsetting first two elements, and then use colSums.

Click to copy

df$Total <- colSums(apply(df[grepl("Quiz", names(df))], 1, function(x) sort(x, decreasing = TRUE)[1:2]))

df
#> # A tibble: 10 × 7
#>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
#>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Aaron     30016    31    42    36    36    78
#>  2 James     87311    25    33    36    43    79
#>  3 Charlotte 61755    41    34    34    39    80
#>  4 Katie     55323    10    22    32    46    78
#>  5 Olivia    94839    35    23    43    40    83
#>  6 Timothy   38209    19    38    38    38    76
#>  7 Grant     34096    27    48    44    43    92
#>  8 Chloe     98432    42    49    42    35    91
#>  9 Judy      19487    15    23    42    41    83
#> 10 Justin    94029    20    30    37    41    78

answered Oct 19 '22 20:10

tjebo

You do not have to do pivot_wider. Note that the longer format is the tidy format. Just do pivot_longer and left_join:

Click to copy

df %>% 
  left_join(pivot_longer(., -c(Student, ID)) %>%
  group_by(Student, ID) %>%
  summarise(Total = sum(sort(value, TRUE)[1:2]), .groups = 'drop'))

# A tibble: 10 x 7
   Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
   <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    31    42    36    36    78
 2 James     87311    25    33    36    43    79
 3 Charlotte 61755    41    34    34    39    80
 4 Katie     55323    10    22    32    46    78
 5 Olivia    94839    35    23    43    40    83
 6 Timothy   38209    19    38    38    38    76
 7 Grant     34096    27    48    44    43    92
 8 Chloe     98432    42    49    42    35    91
 9 Judy      19487    15    23    42    41    83
10 Justin    94029    20    30    37    41    78

answered Oct 19 '22 18:10

KU99

Yet another solution, based on tidyverse:

Click to copy

library(tidyverse)

df %>% 
  rowwise %>% 
  mutate(Quiz = list(c_across(starts_with("Quiz")) * 
   if_else(rank(c_across(starts_with("Quiz")),ties.method="last")>=3,1,NA_real_)),
    across(matches("\\d$"), ~ NULL), total = sum(Quiz, na.rm = T)) %>%
  unnest_wider(Quiz, names_sep = "") 

#> # A tibble: 10 × 7
#>    Student      ID Quiz1 Quiz2 Quiz3 Quiz4 total
#>    <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Aaron     30016    NA    42    36    NA    78
#>  2 James     87311    NA    NA    36    43    79
#>  3 Charlotte 61755    41    NA    NA    39    80
#>  4 Katie     55323    NA    NA    32    46    78
#>  5 Olivia    94839    NA    NA    43    40    83
#>  6 Timothy   38209    NA    38    38    NA    76
#>  7 Grant     34096    NA    48    44    NA    92
#>  8 Chloe     98432    42    49    NA    NA    91
#>  9 Judy      19487    NA    NA    42    41    83
#> 10 Justin    94029    NA    NA    37    41    78

answered Oct 19 '22 20:10

PaulS

As @akrun provided above, collapse is another efficient possibility. radixorder provides an integer ordering vector, and only the top 2 values in each row are kept, while the others are replaced with NA. Then, rowSums is used to get the totals for each row.

Click to copy

library(collapse)

ftransform(gvr(df, "Student|ID"),
           dapply(
             gvr(df, "^Quiz"),
             MARGIN = 1,
             FUN = function(x)
               replace(x, radixorder(radixorder(x)) %in% 1:2, NA)
           )) %>%
  ftransform(Total = rowSums(gvr(., "^Quiz"), na.rm = TRUE))

Output

Click to copy

# A tibble: 10 × 7
   Student      ID Quiz1 Quiz2 Quiz3 Quiz4 Total
 * <chr>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Aaron     30016    NA    42    NA    36    78
 2 James     87311    NA    NA    36    43    79
 3 Charlotte 61755    41    NA    NA    39    80
 4 Katie     55323    NA    NA    32    46    78
 5 Olivia    94839    NA    NA    43    40    83
 6 Timothy   38209    NA    NA    38    38    76
 7 Grant     34096    NA    48    44    NA    92
 8 Chloe     98432    NA    49    42    NA    91
 9 Judy      19487    NA    NA    42    41    83
10 Justin    94029    NA    NA    37    41    78

answered Oct 19 '22 18:10

AndrewGB

(A bit messy) Base R Solution:

Click to copy

# Store the names of quiz columns as a vector: quiz_colnames => character vector
quiz_colnames <- grep("Quiz\\d+", names(df), value = TRUE)

# Store the names of the non-quiz columns as a vector: non_quiz_colnames => character vector
non_quiz_colnames <- names(df)[!(names(df) %in% quiz_colnames)]

# Store an Idx based on the ID: Idx => integer vector:
Idx <- with(df, as.integer(factor(ID, levels = unique(ID))))

# Split-Apply-Combine to calculate the top 2 quizes: res => data.frame
res <- data.frame(
  do.call(
    rbind,
    lapply(
      with(
        df,
        split(
          df,
          Idx 
        )
      ),
      function(x){
        # Extract the top 2 quiz vectors: top_2_quizes => named integer vector
        top_2_quizes <- head(sort(unlist(x[,quiz_colnames]), decreasing = TRUE), 2)
        # Calculate the quiz columns not used: remainder_quiz_cols => character vector
        remainder_quiz_cols <- quiz_colnames[!(quiz_colnames %in% names(top_2_quizes))]
        # Nullify the remaining quizes: x => data.frame 
        x[, remainder_quiz_cols] <- NA_integer_
        # Calculate the resulting data.frame: data.frame => env 
        transform(
          cbind(
            x[,non_quiz_names], 
            x[,names(top_2_quizes)],
            x[,remainder_quiz_cols]
          ),
          Total = sum(top_2_quizes)
        )[,c(non_quiz_names, "Quiz2", "Quiz3", "Quiz1", "Quiz4", "Total")]
      }
    )
  ),
  row.names = NULL,
  stringsAsFactors = FALSE
)

answered Oct 19 '22 20:10

hello_friend

Try this base R to also get the NAs

Click to copy

cbind( df[,1:2], t( sapply( seq_along( 1:nrow( df ) ), function(x){
  ord <- order( unlist( df[x,3:6] ) )[1:2]; arow <- df[x,3:6]; 
  arow[ord] <- NA; ttl <- rowSums( arow[-ord], na.rm=T );
  cbind( arow,Total=ttl ) } ) ) )

     Student    ID Quiz1 Quiz2 Quiz3 Quiz4 Total
1      Aaron 30016    NA    42    NA    36    78
2      James 87311    NA    NA    36    43    79
3  Charlotte 61755    41    NA    NA    39    80
4      Katie 55323    NA    NA    32    46    78
5     Olivia 94839    NA    NA    43    40    83
6    Timothy 38209    NA    NA    38    38    76
7      Grant 34096    NA    48    44    NA    92
8      Chloe 98432    NA    49    42    NA    91
9       Judy 19487    NA    NA    42    41    83
10    Justin 94029    NA    NA    37    41    78

answered Oct 19 '22 19:10

Andre Wildberg

Related questions
                            
                                How do I set width of y-axis labels in ggplot2
                            
                                type/origin of R's 'as' function
                            
                                R package build failing on Unix machines due to missing GSL - GNU Scientific Library
                            
                                Carry / use value from previous group
                            
                                How to reverse a sentence in R?
                            
                                Embed a tweet in a blogdown post
                            
                                Double precision (64-bit) representation of numeric value in R (sign, exponent, significand)
                            
                                using purrr to map dplyr::select
                            
                                Months to integer R
                            
                                ggplot2: Creating themed title, subtitle with cowplot
                            
                                identify consecutively overlapping segments in R
                            
                                String as formula
                            
                                Setting up docker image with R and SQL server drivers
                            
                                Error code 100 fitting exp distribution using fitdist in r
                            
                                Error when trying to write DataFrame to feather. Does feather support list columns?
                            
                                R regex - extract words beginning with @ symbol
                            
                                Filtering a vector on condition
                            
                                A Regex to remove digits except for words starting with #
                            
                                How to pipe SQL into R's dplyr?
                            
                                Triple exclamation marks on R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to summarize the top n values across multiple columns row wise?

Tags:

r

tidyverse

AndrewGB

People also ask

Video Answer

7 Answers

jrcalabrese

tjebo

KU99

PaulS

AndrewGB

hello_friend

Andre Wildberg

Recent Activity

Donate For Us