Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove NAs with the tidyr::unite function?

Tags:

r

tidyr

After combining several columns with tidyr::unite(), NAs from missing data remain in my character vector, which I do not want.

I have a series of medical diagnoses per row (1 per column) and would like to benchmark searching for a series of codes via. %in% and grepl().

There is an open issue on Github on this problem, is there any movement - or work arounds? I would like to keep the vector comma-separated.

Here is a representative example:

library(dplyr)
library(tidyr)

df <- data_frame(a = paste0("A.", rep(1, 3)), b = " ", c = c("C.1", "C.3", " "), d = "D.4", e = "E.5")

cols <- letters[2:4]
df[, cols] <- gsub(" ", NA_character_, as.matrix(df[, cols]))
tidyr::unite(df, new, cols, sep = ",")

Current output:

# # A tibble: 3 x 3
#   a     new        e    
#   <chr> <chr>      <chr>
# 1 A.1   NA,C.1,D.4 E.5  
# 2 A.1   NA,C.3,D.4 E.5  
# 3 A.1   NA,NA,D.4  E.5 

Desired output:

# # A tibble: 3 x 3
#   a     new        e    
#   <chr> <chr>      <chr>
# 1 A.1   C.1,D.4    E.5  
# 2 A.1   C.3,D.4    E.5  
# 3 A.1   D.4        E.5 
like image 802
Paul Avatar asked Oct 09 '18 02:10

Paul


2 Answers

In the new tidyr , you can now use na.rm parameter to remove NA values.

library(tidyr)
library(dplyr)

df %>% unite(new, cols, sep = ",", na.rm = TRUE)

#   a     new     e    
#  <chr> <chr>   <chr>
#1 A.1   C.1,D.4 E.5  
#2 A.1   C.3,D.4 E.5  
#3 A.1   D.4     E.5  

However, NAs would not be removed if have columns are factors. We need to change them to character before using unite.

df %>% 
  mutate_all(as.character) %>%
  unite(new, cols, sep = ",", na.rm = TRUE)

You could also use base R apply method for the same.

apply(df[cols], 1, function(x) toString(na.omit(x)))
#[1] "C.1, D.4" "C.3, D.4" "D.4" 

data

df <- data_frame(
a = c("A.1", "A.1", "A.1"),
b = c(NA_character_, NA_character_, NA_character_),
c = c("C.1", "C.3", NA),
d = c("D.4", "D.4", "D.4"),
e = c("E.5", "E.5", "E.5")
)

cols <- letters[2:4]
like image 70
Ronak Shah Avatar answered Sep 19 '22 16:09

Ronak Shah


You could use regex to remove the NAs after they are created:

library(dplyr)
library(tidyr)

df <- data_frame(a = paste0("A.", rep(1, 3)), 
                 b = " ", 
                 c = c("C.1", "C.3", " "), 
                 d = "D.4", e = "E.5")

cols <- letters[2:4]
df[, cols] <- gsub(" ", NA_character_, as.matrix(df[, cols]))
tidyr::unite(df, new, cols, sep = ",") %>% 
     dplyr::mutate(new = stringr::str_replace_all(new, 'NA,?', ''))  # New line

Output:

# A tibble: 3 x 3
  a     new     e    
  <chr> <chr>   <chr>
1 A.1   C.1,D.4 E.5  
2 A.1   C.3,D.4 E.5  
3 A.1   D.4     E.5  
like image 30
CT Hall Avatar answered Sep 21 '22 16:09

CT Hall