Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

imputing data with median by date in R [duplicate]

I need to replace the missing values in the field "steps" by the median of "steps" calculated over that particular day (group by "date") with NA values removed. I have already referred this thread but my NA values aren't replaced. Can somebody help me find out where am I going wrong? I would prefer using base package/data table/plyr. The dataset looks approx. like this:-

        steps      date interval
    1:    NA 2012-10-01        0
    2:    NA 2012-10-01        5
    3:    NA 2012-10-01       10
    4:    NA 2012-10-01       15
    5:    NA 2012-10-01       20
   ---                          
17564:    NA 2012-11-30     2335
17565:    NA 2012-11-30     2340
17566:    NA 2012-11-30     2345
17567:    NA 2012-11-30     2350
17568:    NA 2012-11-30     2355

The structure and summary of the dataset(activity) are as shown below

 #str(activity)  
 Classes ‘data.table’ and 'data.frame': 17568 obs. of  3 variables:
     $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
     $ date    : Date, format: "2012-10-01" "2012-10-01" "2012-10-01" ...
     $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

#summary(activity)
         steps             date               interval     
     Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
     1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
     Median :  0.00   Median :2012-10-31   Median :1177.5  
     Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
     3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
     Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
     NA's   :2304     

Things I have tried:

Datatable method:

activityrepNA<-activity[,steps := ifelse(is.na(steps), median(steps, na.rm=TRUE), steps), by=date]
summary(activityrepNA)
     steps             date               interval     
 Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Median :  0.00   Median :2012-10-31   Median :1177.5  
 Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
 3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
 Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
 NA's   :2304 

Using ave

activity$steps[is.na(activity$steps)] <- with(activity, ave(steps,date, FUN = function(x) median(x, na.rm = TRUE)))[is.na(activity$steps)]
> summary(activity)
     steps             date               interval     
 Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Median :  0.00   Median :2012-10-31   Median :1177.5  
 Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
 3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
 Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
 NA's   :2304

Attempt at ddply

cleandatapls<-ddply(activity, 
+       .(as.character(date)), 
+       transform, 
+       steps=ifelse(is.na(steps), median(steps, na.rm=TRUE), steps))
> summary(cleandatapls)
 as.character(date)     steps             date               interval     
 Length:17568       Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 Class :character   1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Mode  :character   Median :  0.00   Median :2012-10-31   Median :1177.5  
                    Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
                    3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
                    Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
                    NA's   :2304   

Aggregate for calculating median

whynoclean<-aggregate(activity,by=list(activity$date),FUN=median,na.rm=TRUE)
> summary(whynoclean)
    Group.1               steps        date               interval   
 Min.   :2012-10-01   Min.   :0   Min.   :2012-10-01   Min.   :1178  
 1st Qu.:2012-10-16   1st Qu.:0   1st Qu.:2012-10-16   1st Qu.:1178  
 Median :2012-10-31   Median :0   Median :2012-10-31   Median :1178  
 Mean   :2012-10-31   Mean   :0   Mean   :2012-10-31   Mean   :1178  
 3rd Qu.:2012-11-15   3rd Qu.:0   3rd Qu.:2012-11-15   3rd Qu.:1178  
 Max.   :2012-11-30   Max.   :0   Max.   :2012-11-30   Max.   :1178  
                      NA's   :8                     

EDIT output for the code using mutate

activity %>% group_by(date) %>% mutate(steps = replace(steps, is.na(steps), median(steps, na.rm = T)))
Source: local data table [17,568 x 3]

   steps       date interval
1     NA 2012-10-01        0
2     NA 2012-10-01        5
3     NA 2012-10-01       10
4     NA 2012-10-01       15
5     NA 2012-10-01       20
6     NA 2012-10-01       25
7     NA 2012-10-01       30
8     NA 2012-10-01       35
9     NA 2012-10-01       40
10    NA 2012-10-01       45
..   ...        ...      ... 

UPDATE:

Steven Beaupre helped me realize that my approach for the imputing was flawed as there were specific dates having only NA values that were causing the problem as the median of NA's is NA. Used another suggested approach.

like image 289
Meeshu Avatar asked May 16 '15 15:05

Meeshu


People also ask

How do you replace missing values with median in R?

So, how do you replace missing values in R with the median? Start the mutate() function. Specify the column in which you want to replace the missing values. Use the ifelse() function to identify NA's, and once found, replace them with the median.

Should you impute with mean or median?

Imputing with the median is more robust than imputing with the mean, because it mitigates the effect of outliers.

What is generally the best method for imputing missing values of a numerical feature?

#3 — Mean/Median/Mode Imputation Imputing missing values with statistical averages is probably the most common technique, at least among beginners. You can impute missing values with the mean if the variable is normally distributed, and the median if the distribution is skewed.


1 Answers

Try:

library(dplyr)
df %>% 
  group_by(date) %>% 
  mutate(steps = ifelse(is.na(steps), median(steps, na.rm = T), steps))

If for a given date, all steps are NAs, you could replace them with 0:

df %>% 
  group_by(date) %>% 
  mutate(steps = ifelse(all(is.na(steps)), 0,
                        ifelse(is.na(steps), median(steps, na.rm = T), steps)))
like image 155
Steven Beaupré Avatar answered Nov 07 '22 08:11

Steven Beaupré