Sum columns based on a string match in other columns

Question

I have a data fame with a structure similar to this:

df <- data.frame("id1"=c("A", "A", NA, "B", "A", "A"),
                 "val1"=c(1, 2, NA, 2, 3, 1),
                 "id2"=c("A", "B", "B", "B", "B", "A"),
                 "val2"=c(3, 2, 1, 3, 3, 2),
                 "id3" = c("C", "B", "C", "C", "D", "A"),
                 "val3" = c(2, 1, 2, 2, 1, 1))

> df
   id1 val1 id2 val2 id3 val3
1    A    1   A    3   C    2
2    A    2   B    2   B    1
3 <NA>   NA   B    1   C    2
4    B    2   B    3   C    2
5    A    3   B    3   D    1
6    A    1   A    2   A    1

What I would like to do is, per row, "merge" those id columns where there is a string match, and sum the corresponding val (or value) columns. I hope the request is clear, but in the end I would like an output that looks like:

df_new <- data.frame("id1"=c("A", "A", "B", "B", "A", "A"),
                     "val1"=c(4, 2, 1, 5, 3, 4),
                     "id2"=c("C", "B", "C", "C", "B", NA),
                     "val2"=c(2, 3, 2, 2, 3, NA),
                     "id3"=c(NA, NA, NA, NA, "D", NA),
                     "val3"=c(NA, NA,NA, NA, 1, NA))

> df_new
  id1 val1  id2 val2  id3 val3
1   A    4    C    2 <NA>   NA
2   A    2    B    3 <NA>   NA
3   B    1    C    2 <NA>   NA
4   B    5    C    2 <NA>   NA
5   A    3    B    3    D    1
6   A    4 <NA>   NA <NA>   NA

I tried doing this with a series of ifelse() statements but in the end got very confused. There must be a cleaner way to accomplish this. Thank you in advance for any help!

Maël · Accepted Answer

Here's a pipeline in tidyverse:

library(dplyr)
library(tidyr)

df |>
  mutate(rown = row_number()) |> 
  pivot_longer(-rown, names_pattern = "(id|val)\d+", 
               names_to = ".value", values_drop_na = TRUE) |> 
  summarise(val = sum(val), .by = c(rown, id)) |> 
  mutate(number = row_number(), .by = rown) |> 
  pivot_wider(names_from = "number", values_from = c("id", "val"),
              names_vary = "slowest", names_sep = "") |>
  select(-rown)

#   id1 val1  id2 val2  id3 val3
# 1   A    4    C    2 <NA>   NA
# 2   A    2    B    3 <NA>   NA
# 3   B    1    C    2 <NA>   NA
# 4   B    5    C    2 <NA>   NA
# 5   A    3    B    3    D    1
# 6   A    4 <NA>   NA <NA>   NA

Explanation: The idea is to convert to long to get a tidy format, which makes it easier to manipulate data in general, and especially in the tidyverse. This is why there is pivot_longer, and then pivot_wider (to get to the original wide format). In between, I create a sum by rown and id, and then create a number variable based on the row_number with rown groups, which will be useful to sort correctly in pivot_wider (with names_vary = "slowest").

Sum columns based on a string match in other columns

Tags:

r

dplyr

tidyr

tlmoore

1 Answers

Maël

Recent Activity

Donate For Us

Sum columns based on a string match in other columns

Tags:

r

dplyr

tidyr

tlmoore

1 Answers

Maël

Related questions

Recent Activity

Donate For Us