I have a data frame which is arranged by descending order of date.
ps1 = data.frame(userID = c(21,21,21,22,22,22,23,23,23), color = c(NA,'blue','red','blue',NA,NA,'red',NA,'gold'), age = c('3yrs','2yrs',NA,NA,'3yrs',NA,NA,'4yrs',NA), gender = c('F',NA,'M',NA,NA,'F','F',NA,'F') )
I wish to impute(replace) NA values with previous values and grouped by userID In case the first row of a userID has NA then replace with the next set of values for that userid group.
I am trying to use dplyr and zoo packages something like this...but its not working
cleanedFUG <- filteredUserGroup %>% group_by(UserID) %>% mutate(Age1 = na.locf(Age), Color1 = na.locf(Color), Gender1 = na.locf(Gender) )
I need result df like this:
userID color age gender 1 21 blue 3yrs F 2 21 blue 2yrs F 3 21 red 2yrs M 4 22 blue 3yrs F 5 22 blue 3yrs F 6 22 blue 3yrs F 7 23 red 4yrs F 8 23 red 4yrs F 9 23 gold 4yrs F
You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.
So, how do you replace missing values with basic R code? To replace the missing values, you first identify the NA's with the is.na() function and the $-operator. Then, you use the min() function to replace the NA's with the lowest value.
The fillna() function is used to fill NA/NaN values using the specified method.
require(tidyverse) #fill is part of tidyr ps1 %>% group_by(userID) %>% fill(color, age, gender) %>% #default direction down fill(color, age, gender, .direction = "up")
Which gives you:
Source: local data frame [9 x 4] Groups: userID [3] userID color age gender <dbl> <fctr> <fctr> <fctr> 1 21 blue 3yrs F 2 21 blue 2yrs F 3 21 red 2yrs M 4 22 blue 3yrs F 5 22 blue 3yrs F 6 22 blue 3yrs F 7 23 red 4yrs F 8 23 red 4yrs F 9 23 gold 4yrs F
Using zoo::na.locf
directly on the whole data.frame would fill the NA regardless of the userID
groups. Package dplyr's grouping has unfortunately no effect on na.locf
function, that's why I went with a split:
library(dplyr); library(zoo) ps1 %>% split(ps1$userID) %>% lapply(function(x) {na.locf(na.locf(x), fromLast=T)}) %>% do.call(rbind, .) #### userID color age gender #### 21.1 21 blue 3yrs F #### 21.2 21 blue 2yrs F #### 21.3 21 red 2yrs M #### 22.4 22 blue 3yrs F #### 22.5 22 blue 3yrs F #### 22.6 22 blue 3yrs F #### 23.7 23 red 4yrs F #### 23.8 23 red 4yrs F #### 23.9 23 gold 4yrs F
What it does is that it first splits the data into 3 data.frames, then I apply a first pass of imputation (downwards), then upwards with the anonymous function in lapply
, and eventually use rbind
to bring the data.frames back together. You have the expected output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With