Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Function to impute missing value [duplicate]

Tags:

r

missing-data

I have a dataframe that looks like this:

set.seed(300)
df <- data.frame(site = sort(rep(paste0("site", 1:5), 5)), 
                 value = sample(c(1:5, NA), replace = T, 25))

df 

    site value
1  site1    NA
2  site1     5
3  site1     5
4  site1     5
5  site1     5
6  site2     1
7  site2     5
8  site2     3
9  site2     3
10 site2    NA
11 site3    NA
12 site3     2
13 site3     5
14 site3     4
15 site3     4
16 site4    NA
17 site4    NA
18 site4     4
19 site4     4
20 site4     4
21 site5    NA
22 site5     3
23 site5     3
24 site5     1
25 site5     1    

As you can see, there are several missing values in the valuecolumn. I need to replace missing values in the valuecolumn with the mean for a site. So if there is a missing value for value measured at site1, I need to impute the mean value for site1. However, the dataframe is constantly being added to and imported into R, and the next time I import the dataframe it will likely have increased to something like 50 rows in length and there are likely to be many more missing values in value. I need to make a function that will automatically detect which site a missing value in value was measured at, and impute the missing value for that particular site. Could anybody help me with this?

like image 746
luciano Avatar asked Dec 08 '22 11:12

luciano


2 Answers

Using impute() from package Hmisc and ddply from package plyr:

require(plyr)
require(Hmisc)

df2 <- ddply(df, "site", mutate, imputed.value = impute(value, mean))
like image 91
nacnudus Avatar answered Jan 06 '23 09:01

nacnudus


First, you can get the different levels of the sites.

sites=levels(df$site)

You can then get the means of different levels

nlevels=length(sites)
meanlist=numeric(nlevels)
for (i in 1:nlevels)
    meanlist[i]=mean(df[df[,1]==sites[i],2],na.rm=TRUE)

Then you can fill in each of the NA values. There's probably a faster way, but as long as your set isn't huge, you can do it with for loops.

for (i in 1:dim(df)[1])
    if (is.na(df[i,2]))
         df[i,2]=meanlist[which(sites==df[i,1])]

Hope this helps.

like image 20
Max Candocia Avatar answered Jan 06 '23 08:01

Max Candocia