I want to use dplyr::coalesce
to find the first non-missing value between pairs of variables in a dataframe containing multiple pairs of variable. The goal is to create a new dataframe with now only one copy for each pair of variable (a coalesce variable without NA values).
Here is an example:
df <- data.frame(
A_1=c(NA, NA, 3, 4, 5),
A_2=c(1, 2, NA, NA, NA),
B_1=c(NA, NA, 13, 14, 15),
B_2=c(11, 12, NA, NA, NA))
Expected output:
A B
1 11
2 12
3 13
4 14
5 15
I am guessing a mix of dplyr::coalesce
with maybe dplyr::mutate_at
based on regular expression could be use but I am not sure how to do it. Is there a way to complete this task with the tidyverse grammar?
Thanks!
EDIT: thanks everyone for your answers! However, I should have included the naming convention for my variables to facilitate the transfer of your answers to my actual problem.. I am sorry about that. My variables are geochemistry variables named in two parts (name of chemical element plus name of core).
Example: Al_TAC4.25.275
where Al
is the element and TAC4.25.275
is the core. I want to coalesce the data from 3 different cores (second part of name) for each element (first part of name). I have 25 pairs of element to coalesce.
You can use the following methods to coalesce the values from multiple columns of a pandas DataFrame into one column: The following code shows how to coalesce the values in the points, assists, and rebounds columns into one column, using the first non-null value across the three columns as the coalesced value:
Our example data is a numeric vector with two NA values. Now, we can use the coalesce function to replace these NA values by a different value. In this example, we are replacing the NA values with the number 999: As you can see based on the output of the RStudio console, our updated vector contains the value 999 at each NA position.
Here’s the logic that was used to decide which value to place in the coalesce column: If the value in the assists column is non-null then use that value. Otherwise, if the value in the rebounds column is non-null then use that value. Otherwise, if the value in the points column is non-null then use that value.
As you can see, the coalesce command replaced each missing value of x with the corresponding value in y. Note that the last vector element is still NA, since both input vectors contain an NA value at this position. Do you need more info on the dplyr package?
Here is another more concise solution, comparing to my other one. I think the use of cur_data()
function is very helpful but you could also use across(everything())
in its place:
library(dplyr)
library(purrr)
unique(sub("(\\D)_\\d+", "\\1", names(df))) %>%
map_dfc(~ df %>%
select(starts_with(.x)) %>%
summarise(!!.x := do.call(coalesce, cur_data())))
A B
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
Here is another solution for as many pairs as possible. Just note that I used bang bang
operator (!!!
) in order to collapse elements of a data frame into standalone single arguments so that I could apply coalesce
on them:
library(dplyr)
library(rlang)
as.data.frame(do.call(cbind, lapply(split.default(df, sub("(\\D)_\\d+", "\\1", names(df))), function(x) {
coalesce(!!!x)
})))
A B
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
You could use transmute, e.g.
library(dplyr)
df <- data.frame(
A_1 = c(NA, NA, 3, 4, 5),
A_2 = c(1, 2, NA, NA, NA),
B_1 = c(NA, NA, 13, 14, 15),
B_2 = c(11, 12, NA, NA, NA)
)
df %>%
transmute(A = coalesce(A_1, A_2),
B = coalesce(B_1, B_2))
#> A B
#> 1 1 11
#> 2 2 12
#> 3 3 13
#> 4 4 14
#> 5 5 15
Created on 2021-12-22 by the reprex package (v2.0.1)
Another option, if you have lots of "A_*" and "B_*" columns (source: Romain François, user: @Romain Francois):
library(dplyr)
df <- data.frame(
A_1 = c(NA, NA, 3, 4, 5),
A_2 = c(1, 2, NA, NA, NA),
B_1 = c(NA, NA, 13, 14, 15),
B_2 = c(11, 12, NA, NA, NA)
)
coacross <- function(...) {
coalesce(!!!across(...))
}
df %>%
transmute(A = coacross(starts_with("A_")),
B = coacross(starts_with("B_")))
#> A B
#> 1 1 11
#> 2 2 12
#> 3 3 13
#> 4 4 14
#> 5 5 15
Created on 2021-12-22 by the reprex package (v2.0.1)
Based on your updated question, you don't have lots of "A_*" or "B_*" columns, but instead lots of "*_1", "*_2", and "*_3" columns. I think this is the most straightforward solution for your use-case:
library(dplyr)
df <- data.frame(Al_TAC4.25.275 = c(1, 1, 1, NA, NA, NA),
Al_TAC4.25.276 = c(NA, NA, 2, 2, 2, NA),
Al_TAC4.25.277 = c(NA, NA, 3, NA, NA, 3),
Au_TAC4.25.275 = c(1, 1, 1, NA, NA, NA),
Au_TAC4.25.276 = c(NA, NA, 2, 2, 2, NA),
Au_TAC4.25.277 = c(NA, NA, 3, NA, NA, NA),
Ar_TAC4.25.275 = c(1, 1, 1, NA, NA, 1),
Ar_TAC4.25.276 = c(NA, NA, 2, 2, 2, 2),
Ar_TAC4.25.277 = c(NA, NA, 3, NA, NA, 3))
df
#> Al_TAC4.25.275 Al_TAC4.25.276 Al_TAC4.25.277 Au_TAC4.25.275 Au_TAC4.25.276
#> 1 1 NA NA 1 NA
#> 2 1 NA NA 1 NA
#> 3 1 2 3 1 2
#> 4 NA 2 NA NA 2
#> 5 NA 2 NA NA 2
#> 6 NA NA 3 NA NA
#> Au_TAC4.25.277 Ar_TAC4.25.275 Ar_TAC4.25.276 Ar_TAC4.25.277
#> 1 NA 1 NA NA
#> 2 NA 1 NA NA
#> 3 3 1 2 3
#> 4 NA NA 2 NA
#> 5 NA NA 2 NA
#> 6 NA 1 2 3
names(df) %>%
split(str_extract(., '[:alpha:]+')) %>%
map_dfc(~ coalesce(!!!df[.x][c(1,2,3)]))
#> # A tibble: 6 × 3
#> Al Ar Au
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 1 1 1
#> 3 1 1 1
#> 4 2 2 2
#> 5 2 2 2
#> 6 3 1 NA
# change the order of the list to change the 'priority'
names(df) %>%
split(str_extract(., '[:alpha:]+')) %>%
map_dfc(~ coalesce(!!!df[.x][c(3,2,1)]))
#> # A tibble: 6 × 3
#> Al Ar Au
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 1 1 1
#> 3 3 3 3
#> 4 2 2 2
#> 5 2 2 2
#> 6 3 3 NA
names(df) %>%
split(str_extract(., '[:alpha:]+')) %>%
map_dfc(~ coalesce(!!!df[.x][c(2,1,3)]))
#> # A tibble: 6 × 3
#> Al Ar Au
#> <dbl> <dbl> <dbl>
#> 1 1 1 1
#> 2 1 1 1
#> 3 2 2 2
#> 4 2 2 2
#> 5 2 2 2
#> 6 3 2 NA
Created on 2021-12-22 by the reprex package (v2.0.1)
A base R option
list2DF(
lapply(
split.default(df, gsub("_.*", "", names(df))),
rowSums,
na.rm = TRUE
)
)
gives
A B
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With