Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace NA in multiple columns with value from corresponding columns

Tags:

r

na

dplyr

I'm trying to clean up a data frame and I would like to replace NAs in one column, with a corresponding value from another column. I would also like to do this for multiple columns at once.

Example Data Frame.

set.seed(123) 

dates <- seq(as.Date("2016-01-01"), by = "day", length = 10)  
names <- rep(c("John Doe", "Jane Smith"), each = 5)  
var1_group <- runif(10)  
var2_group <- runif(10)  
var1_person <- runif(10)  
var2_person <- runif(10)  

myDF <- data.frame(names, var1_group, var2_group, var1_person, var2_person)  
myDF <- cbind(dates, myDF)  

After some manipulations using dplyr...

myDF <- myDF %>% mutate_each(funs(lag), contains("group"))  
myDF <- myDF %>% group_by(names) %>% mutate_each(funs(lag), contains("person"))  

I get a bunch of NAs...

        dates      names var1_group var2_group var1_person var2_person  
1  2016-01-01   John Doe         NA         NA          NA          NA  
2  2016-01-02   John Doe  0.2875775 0.95683335   0.8895393   0.9630242  
3  2016-01-03   John Doe  0.7883051 0.45333416   0.6928034   0.9022990  
4  2016-01-04   John Doe  0.4089769 0.67757064   0.6405068   0.6907053  
5  2016-01-05   John Doe  0.8830174 0.57263340   0.9942698   0.7954674  
6  2016-01-06 Jane Smith  0.9404673 0.10292468          NA          NA  
7  2016-01-07 Jane Smith  0.0455565 0.89982497   0.7085305   0.4777960  
8  2016-01-08 Jane Smith  0.5281055 0.24608773   0.5440660   0.7584595  
9  2016-01-09 Jane Smith  0.8924190 0.04205953   0.5941420   0.2164079  
10 2016-01-10 Jane Smith  0.5514350 0.32792072   0.2891597   0.3181810  

What I would like to do now, is replace the NAs from *_person columns with the corresponding value from the *_group column. (See row 6)

        dates      names var1_group var2_group var1_person var2_person  
1  2016-01-01   John Doe         NA         NA          NA          NA  
2  2016-01-02   John Doe  0.2875775 0.95683335   0.8895393   0.9630242  
3  2016-01-03   John Doe  0.7883051 0.45333416   0.6928034   0.9022990  
4  2016-01-04   John Doe  0.4089769 0.67757064   0.6405068   0.6907053  
5  2016-01-05   John Doe  0.8830174 0.57263340   0.9942698   0.7954674  
6  2016-01-06 Jane Smith  0.9404673 0.10292468   0.9404673   0.1029246     
7  2016-01-07 Jane Smith  0.0455565 0.89982497   0.7085305   0.4777960  
8  2016-01-08 Jane Smith  0.5281055 0.24608773   0.5440660   0.7584595  
9  2016-01-09 Jane Smith  0.8924190 0.04205953   0.5941420   0.2164079  
10 2016-01-10 Jane Smith  0.5514350 0.32792072   0.2891597   0.3181810 

This works for one column...

myDF$var1_person <- ifelse(is.na(myDF$var1_person), myDF$var1_group, myDF$var1_person)  

But I'd like to do it for all the columns at once. In my actual data frame, each group is around 20 columns. I've tried a bunch of other stuff, but I don't want to clutter this post with my nonsense.

*Bonus points if you can get the code to match for n variables based on column prefix.

var1_group > var1_person  
var2_group > var2_person
...
varn_group > varn_person   
like image 760
JPete Avatar asked Sep 13 '16 23:09

JPete


3 Answers

Here is one option using set from data.table which does the replacement in place

library(data.table)
#convert the data.frame to data.table
setDT(myDF)
#get the column name of 'group' and 'person' columns
nm1 <-  grep("group", names(myDF), value = TRUE)
nm2 <-  grep("person", names(myDF), value = TRUE)
#loop through the sequence of 'nm1'
for(j in seq_along(nm1)){
#set the elements in the row that are NA for each 'period' column
#with the corresponding row from 'group' column specified in the "value"
    set(myDF, i = which(is.na(myDF[[nm2[j]]])), j = nm2[j],
                    value = myDF[[nm1[j]]][is.na(myDF[[nm2[j]]])])
}

 myDF
 #        dates      names var1_group var2_group var1_person var2_person
 #1: 2016-01-01   John Doe         NA         NA          NA          NA
 #2: 2016-01-02   John Doe  0.2875775 0.95683335   0.8895393   0.9630242
 #3: 2016-01-03   John Doe  0.7883051 0.45333416   0.6928034   0.9022990
 #4: 2016-01-04   John Doe  0.4089769 0.67757064   0.6405068   0.6907053
 #5: 2016-01-05   John Doe  0.8830174 0.57263340   0.9942698   0.7954674
 #6: 2016-01-06 Jane Smith  0.9404673 0.10292468   0.9404673   0.1029247
 #7: 2016-01-07 Jane Smith  0.0455565 0.89982497   0.7085305   0.4777960
 #8: 2016-01-08 Jane Smith  0.5281055 0.24608773   0.5440660   0.7584595
 #9: 2016-01-09 Jane Smith  0.8924190 0.04205953   0.5941420   0.2164079
 #10:2016-01-10 Jane Smith  0.5514350 0.32792072   0.2891597   0.3181810
like image 95
akrun Avatar answered Nov 07 '22 18:11

akrun


Here's a "tidyverse" approach. Note that as @Gregor commented, it helps to tidy your data. The following handles this for you and also returns a somewhat tidy data frame. I'll leave it to you to get back into the original format if necessary.

Note that I've used the mutate_cond() function that can be found here.

library(tidyverse)
library(stringr)

myDF %>%
  gather(key = col, value = val, -dates, -names) %>% 
  mutate(col = str_replace(col, "var", "")) %>% 
  separate(col, into = c("var", "group")) %>%
  spread(key = group, value = val) %>% 
  mutate_cond(is.na(person), person = group)

#> Source: local data frame [20 x 5]
#> Groups: names [2]
#> 
#>         dates      names   var      group    person
#> *      <date>     <fctr> <chr>      <dbl>     <dbl>
#> 1  2016-01-01   John Doe     1         NA        NA
#> 2  2016-01-01   John Doe     2         NA        NA
#> 3  2016-01-02   John Doe     1 0.28757752 0.8895393
#> 4  2016-01-02   John Doe     2 0.95683335 0.9630242
#> 5  2016-01-03   John Doe     1 0.78830514 0.6928034
#> 6  2016-01-03   John Doe     2 0.45333416 0.9022990
#> 7  2016-01-04   John Doe     1 0.40897692 0.6405068
#> 8  2016-01-04   John Doe     2 0.67757064 0.6907053
#> 9  2016-01-05   John Doe     1 0.88301740 0.9942698
#> 10 2016-01-05   John Doe     2 0.57263340 0.7954674
#> 11 2016-01-06 Jane Smith     1 0.94046728 0.9404673
#> 12 2016-01-06 Jane Smith     2 0.10292468 0.1029247
#> 13 2016-01-07 Jane Smith     1 0.04555650 0.7085305
#> 14 2016-01-07 Jane Smith     2 0.89982497 0.4777960
#> 15 2016-01-08 Jane Smith     1 0.52810549 0.5440660
#> 16 2016-01-08 Jane Smith     2 0.24608773 0.7584595
#> 17 2016-01-09 Jane Smith     1 0.89241904 0.5941420
#> 18 2016-01-09 Jane Smith     2 0.04205953 0.2164079
#> 19 2016-01-10 Jane Smith     1 0.55143501 0.2891597
#> 20 2016-01-10 Jane Smith     2 0.32792072 0.3181810

Everything but the last line is about tidying the data. The last line (mutate_cond()) handles the substitution of NA values. If your columns are all named this way, then this should extend to any n.

like image 21
Simon Jackson Avatar answered Nov 07 '22 16:11

Simon Jackson


Here's an idea:

tag <- c("person", "group")

# Create a list of 2 elements, persons and groups.
lst <- lapply(tag, function(x) { myDF[grepl(x, colnames(myDF))] })

# Extract everything before the underscore "_" in the column names
ext <- lapply(seq_along(lst), function(x) { 
  stringi::stri_extract(colnames(lst[[x]]), regex = "^[^_]+(?=_)") })

# Find the common elements between the two
int <- intersect(ext[[1]], ext[[2]]) 

# Create a new list with only the matching subset 
match_list <- lapply(lst, function(x) { select(x, matches(paste(int, collapse = "|"))) })

# Replace all NA values in 'person' by the corresponding values in 'group'
res <- mapply(function(x, y) { replace(x, is.na(x), y[is.na(x)]) }, 
              match_list[[1]], match_list[[2]])

# Assign the result back to the original data.frame
myDF[, colnames(res)] <- res

This should ignore non-matching person/group pairs and only replace on matching vars

like image 3
Steven Beaupré Avatar answered Nov 07 '22 18:11

Steven Beaupré