I have a huge dataset and wish to replace values in certain columns (VAR1, VAR2) with NA if they do not start with AA or DD.
Data:
DF<-tibble::tribble(
~ID, ~VAR1, ~VAR1DATE, ~VAR2, ~VAR2DATE,
1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
)
Desired output:
A tibble: 5 × 5
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 NA NA
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 NA NA DDCC 2001-01-03
4 4 DDAA 2001-01-04 NA NA
5 5 NA NA NA NA
Is there an elegant and smart way to this? mutate_all?
replace_with_na_all() Replaces NA for all variables. replace_with_na_at() Replaces NA on a subset of variables specified with character quotes (e.g., c(“var1”, “var2”)). replace_with_na_if() Replaces NA based on applying an operation on the subset of variables for which a predicate function (is.
How do I replace NA values on a numeric column with 0 (zero) in an R DataFrame (data. frame)? You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.
To remove observations with missing values in at least one column, you can use the na. omit() function. The na. omit() function in the R language inspects all columns from a data frame and drops rows that have NA's in one or more columns.
Now you could replace zeroes by NULL in a data frame in the sense of completely removing all the rows containing at least one zero. When using, e.g., var , cov , or cor , that is actually equivalent to first replacing zeroes with NA and setting the value of use as "complete.
We may do this in two steps - loop across the columns that have 'VAR' followed by digits (\\d+) in column names, replace the values where the first two characters are not AA or DD to NA, then replace the corresponding DATE column to NA based on the NA in the 'VAR1', 'VAR2' columns
library(dplyr)
library(stringr)
DF %>%
mutate(across(matches("^VAR\\d+$"),
~ replace(., !substr(., 1, 2) %in% c("AA", "DD"), NA)),
across(ends_with("DATE"),
~ replace(., is.na(get(str_remove(cur_column(), "DATE"))), NA)))
-output
# A tibble: 5 × 5
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 <NA> <NA>
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 <NA> <NA> DDCC 2001-01-03
4 4 DDAA 2001-01-04 <NA> <NA>
5 5 <NA> <NA> <NA> <NA>
Use mutate/across with the assignment function is.na<-.
DF %>%
mutate(across(starts_with("VAR"), \(x) `is.na<-`(x, !grepl("^AA|^DD", x))))
## A tibble: 5 x 5
# ID VAR1 VAR1DATE VAR2 VAR2DATE
# <int> <chr> <chr> <chr> <chr>
#1 1 AABB NA NA NA
#2 2 AACC NA AACC NA
#3 3 NA NA DDCC NA
#4 4 DDAA NA NA NA
#5 5 NA NA NA NA
Or simpler:
DF %>%
mutate(across(starts_with("VAR"), ~`is.na<-`(., !grepl("^AA|^DD", .))))
Here is a tidyverse solution. Using across with str_replace_all and appending two ifelse statements.
library(dplyr)
library(stringr)
DF %>%
mutate(across(c(VAR1, VAR2), ~str_replace_all(., "^[^AA|DD]", NA_character_))) %>%
mutate(VAR1DATE = ifelse(is.na(VAR1), NA_character_, VAR1DATE),
VAR2DATE = ifelse(is.na(VAR2), NA_character_, VAR2DATE))
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 NA NA
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 NA NA DDCC 2001-01-03
4 4 DDAA 2001-01-04 NA NA
5 5 NA NA NA NA
Another possibility, using tidyr::unite and tiydr::separate:
library(tidyverse)
DF<-tibble::tribble(
~ID, ~VAR1, ~VAR1DATE, ~VAR2, ~VAR2DATE,
1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
)
DF %>%
unite(VAR1,VAR1,VAR1DATE) %>% unite(VAR2,VAR2,VAR2DATE) %>%
mutate(across(starts_with("VAR"),~if_else(str_detect(.x, "^AA|^DD"), .x, NA_character_))) %>%
separate(VAR1,into = c("VAR1", "VAR1DATE"), sep = "_") %>%
separate(VAR2,into = c("VAR2", "VAR2DATE"), sep = "_")
#> # A tibble: 5 × 5
#> ID VAR1 VAR1DATE VAR2 VAR2DATE
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 AABB 2001-01-01 <NA> <NA>
#> 2 2 AACC 2001-01-02 AACC 2001-01-02
#> 3 3 <NA> <NA> DDCC 2001-01-03
#> 4 4 DDAA 2001-01-04 <NA> <NA>
#> 5 5 <NA> <NA> <NA> <NA>
Here is another tidyverse solution with using str_detect to determine where to convert to NA for the date columns. Then, we can use the same type of function on VAR1 and VAR2.
library(tidyverse)
DF %>%
rowwise %>%
mutate(
VAR1DATE = ifelse(str_detect(VAR1, '^BB|^CC') == TRUE, NA, VAR1DATE),
VAR2DATE = ifelse(str_detect(VAR2, '^BB|^CC') == TRUE, NA, VAR2DATE)
) %>%
mutate(across(c(VAR1, VAR2), function(x)
ifelse(str_detect(x, '^BB|^CC') == TRUE, NA, x)))
Output
# A tibble: 5 × 5
# Rowwise:
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 NA NA
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 NA NA DDCC 2001-01-03
4 4 DDAA 2001-01-04 NA NA
5 5 NA NA NA NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With