Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

replace values with NA in several columns

Tags:

r

I have a huge dataset and wish to replace values in certain columns (VAR1, VAR2) with NA if they do not start with AA or DD.

Data:

DF<-tibble::tribble(
  ~ID,  ~VAR1,    ~VAR1DATE,  ~VAR2,    ~VAR2DATE,
   1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
   2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
   3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
   4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
   5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
  )

Desired output:

A tibble: 5 × 5
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 NA    NA        
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 NA    NA         DDCC  2001-01-03
4     4 DDAA  2001-01-04 NA    NA        
5     5 NA    NA         NA    NA  

Is there an elegant and smart way to this? mutate_all?

like image 525
hkolvs Avatar asked Dec 27 '21 18:12

hkolvs


People also ask

How do you replace specific values with Na?

replace_with_na_all() Replaces NA for all variables. replace_with_na_at() Replaces NA on a subset of variables specified with character quotes (e.g., c(“var1”, “var2”)). replace_with_na_if() Replaces NA based on applying an operation on the subset of variables for which a predicate function (is.

How do I replace Na with 0 in a specific column in R?

How do I replace NA values on a numeric column with 0 (zero) in an R DataFrame (data. frame)? You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.

How do I remove Nas from multiple columns in R?

To remove observations with missing values in at least one column, you can use the na. omit() function. The na. omit() function in the R language inspects all columns from a data frame and drops rows that have NA's in one or more columns.

How do you replace zero with Na?

Now you could replace zeroes by NULL in a data frame in the sense of completely removing all the rows containing at least one zero. When using, e.g., var , cov , or cor , that is actually equivalent to first replacing zeroes with NA and setting the value of use as "complete.


Video Answer


5 Answers

We may do this in two steps - loop across the columns that have 'VAR' followed by digits (\\d+) in column names, replace the values where the first two characters are not AA or DD to NA, then replace the corresponding DATE column to NA based on the NA in the 'VAR1', 'VAR2' columns

library(dplyr)
library(stringr)
DF %>%
    mutate(across(matches("^VAR\\d+$"),
        ~ replace(., !substr(., 1, 2)  %in% c("AA", "DD"), NA)), 
      across(ends_with("DATE"), 
     ~ replace(., is.na(get(str_remove(cur_column(), "DATE"))), NA)))

-output

# A tibble: 5 × 5
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 <NA>  <NA>      
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 <NA>  <NA>       DDCC  2001-01-03
4     4 DDAA  2001-01-04 <NA>  <NA>      
5     5 <NA>  <NA>       <NA>  <NA>      
like image 189
akrun Avatar answered Oct 21 '22 23:10

akrun


Use mutate/across with the assignment function is.na<-.

DF %>%
  mutate(across(starts_with("VAR"), \(x) `is.na<-`(x, !grepl("^AA|^DD", x))))
## A tibble: 5 x 5
#     ID VAR1  VAR1DATE VAR2  VAR2DATE
#  <int> <chr> <chr>    <chr> <chr>   
#1     1 AABB  NA       NA    NA      
#2     2 AACC  NA       AACC  NA      
#3     3 NA    NA       DDCC  NA      
#4     4 DDAA  NA       NA    NA      
#5     5 NA    NA       NA    NA      

Or simpler:

DF %>%
  mutate(across(starts_with("VAR"), ~`is.na<-`(., !grepl("^AA|^DD", .))))
like image 32
Rui Barradas Avatar answered Oct 22 '22 01:10

Rui Barradas


Here is a tidyverse solution. Using across with str_replace_all and appending two ifelse statements.

library(dplyr)
library(stringr)

DF %>% 
  mutate(across(c(VAR1, VAR2), ~str_replace_all(., "^[^AA|DD]", NA_character_))) %>% 
  mutate(VAR1DATE = ifelse(is.na(VAR1), NA_character_, VAR1DATE),
         VAR2DATE = ifelse(is.na(VAR2), NA_character_, VAR2DATE))
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 NA    NA        
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 NA    NA         DDCC  2001-01-03
4     4 DDAA  2001-01-04 NA    NA        
5     5 NA    NA         NA    NA 
like image 41
TarJae Avatar answered Oct 22 '22 00:10

TarJae


Another possibility, using tidyr::unite and tiydr::separate:

library(tidyverse)

DF<-tibble::tribble(
  ~ID,  ~VAR1,    ~VAR1DATE,  ~VAR2,    ~VAR2DATE,
  1L, "AABB", "2001-01-01", "BBAA", "2001-01-01",
  2L, "AACC", "2001-01-02", "AACC", "2001-01-02",
  3L, "CCDD", "2001-01-03", "DDCC", "2001-01-03",
  4L, "DDAA", "2001-01-04", "CCBB", "2001-01-04",
  5L, "CCBB", "2001-01-05", "CCBB", "2001-01-05"
)

DF %>% 
  unite(VAR1,VAR1,VAR1DATE) %>% unite(VAR2,VAR2,VAR2DATE) %>% 
  mutate(across(starts_with("VAR"),~if_else(str_detect(.x, "^AA|^DD"), .x, NA_character_))) %>% 
  separate(VAR1,into = c("VAR1", "VAR1DATE"), sep = "_") %>% 
  separate(VAR2,into = c("VAR2", "VAR2DATE"), sep = "_")

#> # A tibble: 5 × 5
#>      ID VAR1  VAR1DATE   VAR2  VAR2DATE  
#>   <int> <chr> <chr>      <chr> <chr>     
#> 1     1 AABB  2001-01-01 <NA>  <NA>      
#> 2     2 AACC  2001-01-02 AACC  2001-01-02
#> 3     3 <NA>  <NA>       DDCC  2001-01-03
#> 4     4 DDAA  2001-01-04 <NA>  <NA>      
#> 5     5 <NA>  <NA>       <NA>  <NA>
like image 24
PaulS Avatar answered Oct 21 '22 23:10

PaulS


Here is another tidyverse solution with using str_detect to determine where to convert to NA for the date columns. Then, we can use the same type of function on VAR1 and VAR2.

library(tidyverse)

DF %>%
  rowwise %>%
  mutate(
    VAR1DATE = ifelse(str_detect(VAR1, '^BB|^CC') == TRUE, NA, VAR1DATE),
    VAR2DATE = ifelse(str_detect(VAR2, '^BB|^CC') == TRUE, NA, VAR2DATE)
  ) %>%
  mutate(across(c(VAR1, VAR2), function(x)
    ifelse(str_detect(x, '^BB|^CC') == TRUE, NA, x)))

Output

# A tibble: 5 × 5
# Rowwise: 
     ID VAR1  VAR1DATE   VAR2  VAR2DATE  
  <int> <chr> <chr>      <chr> <chr>     
1     1 AABB  2001-01-01 NA    NA        
2     2 AACC  2001-01-02 AACC  2001-01-02
3     3 NA    NA         DDCC  2001-01-03
4     4 DDAA  2001-01-04 NA    NA        
5     5 NA    NA         NA    NA      
like image 1
AndrewGB Avatar answered Oct 21 '22 23:10

AndrewGB