Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sum columns based on a string match in other columns

Tags:

r

dplyr

tidyr

I have a data fame with a structure similar to this:

df <- data.frame("id1"=c("A", "A", NA, "B", "A", "A"),
                 "val1"=c(1, 2, NA, 2, 3, 1),
                 "id2"=c("A", "B", "B", "B", "B", "A"),
                 "val2"=c(3, 2, 1, 3, 3, 2),
                 "id3" = c("C", "B", "C", "C", "D", "A"),
                 "val3" = c(2, 1, 2, 2, 1, 1))

> df
   id1 val1 id2 val2 id3 val3
1    A    1   A    3   C    2
2    A    2   B    2   B    1
3 <NA>   NA   B    1   C    2
4    B    2   B    3   C    2
5    A    3   B    3   D    1
6    A    1   A    2   A    1

What I would like to do is, per row, "merge" those id columns where there is a string match, and sum the corresponding val (or value) columns. I hope the request is clear, but in the end I would like an output that looks like:

df_new <- data.frame("id1"=c("A", "A", "B", "B", "A", "A"),
                     "val1"=c(4, 2, 1, 5, 3, 4),
                     "id2"=c("C", "B", "C", "C", "B", NA),
                     "val2"=c(2, 3, 2, 2, 3, NA),
                     "id3"=c(NA, NA, NA, NA, "D", NA),
                     "val3"=c(NA, NA,NA, NA, 1, NA))

> df_new
  id1 val1  id2 val2  id3 val3
1   A    4    C    2 <NA>   NA
2   A    2    B    3 <NA>   NA
3   B    1    C    2 <NA>   NA
4   B    5    C    2 <NA>   NA
5   A    3    B    3    D    1
6   A    4 <NA>   NA <NA>   NA

I tried doing this with a series of ifelse() statements but in the end got very confused. There must be a cleaner way to accomplish this. Thank you in advance for any help!

like image 710
tlmoore Avatar asked Oct 17 '25 21:10

tlmoore


1 Answers

Here's a pipeline in tidyverse:

library(dplyr)
library(tidyr)

df |>
  mutate(rown = row_number()) |> 
  pivot_longer(-rown, names_pattern = "(id|val)\\d+", 
               names_to = ".value", values_drop_na = TRUE) |> 
  summarise(val = sum(val), .by = c(rown, id)) |> 
  mutate(number = row_number(), .by = rown) |> 
  pivot_wider(names_from = "number", values_from = c("id", "val"),
              names_vary = "slowest", names_sep = "") |>
  select(-rown)

#   id1 val1  id2 val2  id3 val3
# 1   A    4    C    2 <NA>   NA
# 2   A    2    B    3 <NA>   NA
# 3   B    1    C    2 <NA>   NA
# 4   B    5    C    2 <NA>   NA
# 5   A    3    B    3    D    1
# 6   A    4 <NA>   NA <NA>   NA

Explanation: The idea is to convert to long to get a tidy format, which makes it easier to manipulate data in general, and especially in the tidyverse. This is why there is pivot_longer, and then pivot_wider (to get to the original wide format). In between, I create a sum by rown and id, and then create a number variable based on the row_number with rown groups, which will be useful to sort correctly in pivot_wider (with names_vary = "slowest").

like image 73
Maël Avatar answered Oct 19 '25 13:10

Maël



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!