I have a huge dataset and wish to replace values in certain columns (VAR1, VAR2) with NA if they do not start with AA or DD.
Data:
DF<-tibble::tribble(
~ID, ~VAR1, ~VAR1DATE, ~VAR2, ~VAR2DATE,
1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
)
Desired output:
A tibble: 5 × 5
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 NA NA
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 NA NA DDCC 2001-01-03
4 4 DDAA 2001-01-04 NA NA
5 5 NA NA NA NA
Is there an elegant and smart way to this? mutate_all
?
replace_with_na_all() Replaces NA for all variables. replace_with_na_at() Replaces NA on a subset of variables specified with character quotes (e.g., c(“var1”, “var2”)). replace_with_na_if() Replaces NA based on applying an operation on the subset of variables for which a predicate function (is.
How do I replace NA values on a numeric column with 0 (zero) in an R DataFrame (data. frame)? You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.
To remove observations with missing values in at least one column, you can use the na. omit() function. The na. omit() function in the R language inspects all columns from a data frame and drops rows that have NA's in one or more columns.
Now you could replace zeroes by NULL in a data frame in the sense of completely removing all the rows containing at least one zero. When using, e.g., var , cov , or cor , that is actually equivalent to first replacing zeroes with NA and setting the value of use as "complete.
We may do this in two steps - loop across
the columns that have 'VAR' followed by digits (\\d+
) in column names, replace
the values where the first two characters are not AA
or DD
to NA
, then replace
the corresponding DATE
column to NA
based on the NA
in the 'VAR1', 'VAR2' columns
library(dplyr)
library(stringr)
DF %>%
mutate(across(matches("^VAR\\d+$"),
~ replace(., !substr(., 1, 2) %in% c("AA", "DD"), NA)),
across(ends_with("DATE"),
~ replace(., is.na(get(str_remove(cur_column(), "DATE"))), NA)))
-output
# A tibble: 5 × 5
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 <NA> <NA>
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 <NA> <NA> DDCC 2001-01-03
4 4 DDAA 2001-01-04 <NA> <NA>
5 5 <NA> <NA> <NA> <NA>
Use mutate/across
with the assignment function is.na<-
.
DF %>%
mutate(across(starts_with("VAR"), \(x) `is.na<-`(x, !grepl("^AA|^DD", x))))
## A tibble: 5 x 5
# ID VAR1 VAR1DATE VAR2 VAR2DATE
# <int> <chr> <chr> <chr> <chr>
#1 1 AABB NA NA NA
#2 2 AACC NA AACC NA
#3 3 NA NA DDCC NA
#4 4 DDAA NA NA NA
#5 5 NA NA NA NA
Or simpler:
DF %>%
mutate(across(starts_with("VAR"), ~`is.na<-`(., !grepl("^AA|^DD", .))))
Here is a tidyverse solution. Using across
with str_replace_all
and appending two ifelse
statements.
library(dplyr)
library(stringr)
DF %>%
mutate(across(c(VAR1, VAR2), ~str_replace_all(., "^[^AA|DD]", NA_character_))) %>%
mutate(VAR1DATE = ifelse(is.na(VAR1), NA_character_, VAR1DATE),
VAR2DATE = ifelse(is.na(VAR2), NA_character_, VAR2DATE))
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 NA NA
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 NA NA DDCC 2001-01-03
4 4 DDAA 2001-01-04 NA NA
5 5 NA NA NA NA
Another possibility, using tidyr::unite
and tiydr::separate
:
library(tidyverse)
DF<-tibble::tribble(
~ID, ~VAR1, ~VAR1DATE, ~VAR2, ~VAR2DATE,
1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
)
DF %>%
unite(VAR1,VAR1,VAR1DATE) %>% unite(VAR2,VAR2,VAR2DATE) %>%
mutate(across(starts_with("VAR"),~if_else(str_detect(.x, "^AA|^DD"), .x, NA_character_))) %>%
separate(VAR1,into = c("VAR1", "VAR1DATE"), sep = "_") %>%
separate(VAR2,into = c("VAR2", "VAR2DATE"), sep = "_")
#> # A tibble: 5 × 5
#> ID VAR1 VAR1DATE VAR2 VAR2DATE
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 AABB 2001-01-01 <NA> <NA>
#> 2 2 AACC 2001-01-02 AACC 2001-01-02
#> 3 3 <NA> <NA> DDCC 2001-01-03
#> 4 4 DDAA 2001-01-04 <NA> <NA>
#> 5 5 <NA> <NA> <NA> <NA>
Here is another tidyverse
solution with using str_detect
to determine where to convert to NA
for the date columns. Then, we can use the same type of function on VAR1
and VAR2
.
library(tidyverse)
DF %>%
rowwise %>%
mutate(
VAR1DATE = ifelse(str_detect(VAR1, '^BB|^CC') == TRUE, NA, VAR1DATE),
VAR2DATE = ifelse(str_detect(VAR2, '^BB|^CC') == TRUE, NA, VAR2DATE)
) %>%
mutate(across(c(VAR1, VAR2), function(x)
ifelse(str_detect(x, '^BB|^CC') == TRUE, NA, x)))
Output
# A tibble: 5 × 5
# Rowwise:
ID VAR1 VAR1DATE VAR2 VAR2DATE
<int> <chr> <chr> <chr> <chr>
1 1 AABB 2001-01-01 NA NA
2 2 AACC 2001-01-02 AACC 2001-01-02
3 3 NA NA DDCC 2001-01-03
4 4 DDAA 2001-01-04 NA NA
5 5 NA NA NA NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With