I have a dataframe that looks like this:
set.seed(300)
df <- data.frame(site = sort(rep(paste0("site", 1:5), 5)),
value = sample(c(1:5, NA), replace = T, 25))
df
site value
1 site1 NA
2 site1 5
3 site1 5
4 site1 5
5 site1 5
6 site2 1
7 site2 5
8 site2 3
9 site2 3
10 site2 NA
11 site3 NA
12 site3 2
13 site3 5
14 site3 4
15 site3 4
16 site4 NA
17 site4 NA
18 site4 4
19 site4 4
20 site4 4
21 site5 NA
22 site5 3
23 site5 3
24 site5 1
25 site5 1
As you can see, there are several missing values in the value
column. I need to replace missing values in the value
column with the mean for a site. So if there is a missing value for value
measured at site1
, I need to impute the mean value
for site1
. However, the dataframe is constantly being added to and imported into R, and the next time I import the dataframe it will likely have increased to something like 50 rows in length and there are likely to be many more missing values in value
. I need to make a function that will automatically detect which site a missing value in value
was measured at, and impute the missing value for that particular site. Could anybody help me with this?
Using impute()
from package Hmisc
and ddply
from package plyr
:
require(plyr)
require(Hmisc)
df2 <- ddply(df, "site", mutate, imputed.value = impute(value, mean))
First, you can get the different levels of the sites.
sites=levels(df$site)
You can then get the means of different levels
nlevels=length(sites)
meanlist=numeric(nlevels)
for (i in 1:nlevels)
meanlist[i]=mean(df[df[,1]==sites[i],2],na.rm=TRUE)
Then you can fill in each of the NA values. There's probably a faster way, but as long as your set isn't huge, you can do it with for loops.
for (i in 1:dim(df)[1])
if (is.na(df[i,2]))
df[i,2]=meanlist[which(sites==df[i,1])]
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With