I have a dataframe with the lengths and widths of various arthropods from the guts of salamanders. Because some guts had thousands of certain prey items, I only measured a subset of each prey type. I now want to replace each unmeasured individual with the mean length and width for that prey. I want to keep the dataframe and just add imputed columns (length2, width2). The main reason is that each row also has columns with data on the date and location the salamander was collected. I could fill in the NA with a random selection of the measured individuals but for the sake of argument let's assume I just want to replace each NA with the mean. For example imagine I have a dataframe that looks something like: <pre class="prettyprint"><code>id taxa length width 101 collembola 2.1 0.9 102 mite 0.9 0.7 103 mite 1.1 0.8 104 collembola NA NA 105 collembola 1.5 0.5 106 mite NA NA </code></pre> In reality I have more columns and about 25 different taxa and a total of ~30,000 prey items in total. It seems like the plyr package might be ideal for this but I just can't figure out how to do this. I'm not very R or programming savvy but I'm trying to learn. Not that I know what I'm doing but I'll try to create a small dataset to play with if it helps. <pre class="prettyprint"><code>exampleDF <- data.frame(id = seq(1:100), taxa = c(rep("collembola", 50), rep("mite", 25), rep("ant", 25)), length = c(rnorm(40, 1, 0.5), rep("NA", 10), rnorm(20, 0.8, 0.1), rep("NA", 5), rnorm(20, 2.5, 0.5), rep("NA", 5)), width = c(rnorm(40, 0.5, 0.25), rep("NA", 10), rnorm(20, 0.3, 0.01), rep("NA", 5), rnorm(20, 1, 0.1), rep("NA", 5))) </code></pre> Here are a few things I've tried (that haven't worked): <pre class="prettyprint"><code># mean imputation to recode NA in length and width with means (could do random imputation but unnecessary here) mean.imp <- function(x) { missing <- is.na(x) n.missing <-sum(missing) x.obs <-a[!missing] imputed <- x imputed[missing] <- mean(x.obs) return (imputed) } mean.imp(exampleDF[exampleDF$taxa == "collembola", "length"]) n.taxa <- length(unique(exampleDF$taxa)) for(i in 1:n.taxa) { mean.imp(exampleDF[exampleDF$taxa == unique(exampleDF$taxa[i]), "length"]) } # no way to get back into dataframe in proper places, try plyr? </code></pre> another attempt: <pre class="prettyprint"><code>imp.mean <- function(x) { a <- mean(x, na.rm = TRUE) return (ifelse (is.na(x) == TRUE , a, x)) } # tried but not sure how to use this in ddply Diet2 <- ddply(exampleDF, .(taxa), transform, length2 = function(x) { a <- mean(exampleDF$length, na.rm = TRUE) return (ifelse (is.na(exampleDF$length) == TRUE , a, exampleDF$length)) }) </code></pre> Any suggestions?

Not my own technique I saw it on the boards a while back: <pre class="prettyprint"><code>dat <- read.table(text = "id taxa length width 101 collembola 2.1 0.9 102 mite 0.9 0.7 103 mite 1.1 0.8 104 collembola NA NA 105 collembola 1.5 0.5 106 mite NA NA", header=TRUE) library(plyr) impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE)) dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length), width = impute.mean(width)) dat2[order(dat2$id), ] #plyr orders by group so we have to reorder </code></pre> Edit A non plyr approach with a <code>for</code> loop: <pre class="prettyprint"><code>for (i in which(sapply(dat, is.numeric))) { for (j in which(is.na(dat[, i]))) { dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i], na.rm = TRUE) } } </code></pre> Edit many moons later here is a data.table & dplyr approach: data.table <pre class="prettyprint"><code>library(data.table) setDT(dat) dat[, length := impute.mean(length), by = taxa][, width := impute.mean(width), by = taxa] </code></pre> dplyr <pre class="prettyprint"><code>library(dplyr) dat %>% group_by(taxa) %>% mutate( length = impute.mean(length), width = impute.mean(width) ) </code></pre>

How to replace NA with mean by group / subset?

Tags:

I have a dataframe with the lengths and widths of various arthropods from the guts of salamanders. Because some guts had thousands of certain prey items, I only measured a subset of each prey type. I now want to replace each unmeasured individual with the mean length and width for that prey. I want to keep the dataframe and just add imputed columns (length2, width2). The main reason is that each row also has columns with data on the date and location the salamander was collected. I could fill in the NA with a random selection of the measured individuals but for the sake of argument let's assume I just want to replace each NA with the mean.

For example imagine I have a dataframe that looks something like:

id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA

In reality I have more columns and about 25 different taxa and a total of ~30,000 prey items in total. It seems like the plyr package might be ideal for this but I just can't figure out how to do this. I'm not very R or programming savvy but I'm trying to learn.

Not that I know what I'm doing but I'll try to create a small dataset to play with if it helps.

exampleDF <- data.frame(id = seq(1:100), taxa = c(rep("collembola", 50), rep("mite", 25), 
rep("ant", 25)), length = c(rnorm(40, 1, 0.5), rep("NA", 10), rnorm(20, 0.8, 0.1), rep("NA", 
5), rnorm(20, 2.5, 0.5), rep("NA", 5)), width = c(rnorm(40, 0.5, 0.25), rep("NA", 10), 
rnorm(20, 0.3, 0.01), rep("NA", 5), rnorm(20, 1, 0.1), rep("NA", 5)))

Here are a few things I've tried (that haven't worked):

# mean imputation to recode NA in length and width with means 
  (could do random imputation but unnecessary here)
mean.imp <- function(x) { 
  missing <- is.na(x) 
  n.missing <-sum(missing) 
  x.obs <-a[!missing] 
  imputed <- x 
  imputed[missing] <- mean(x.obs) 
  return (imputed) 
  } 

mean.imp(exampleDF[exampleDF$taxa == "collembola", "length"])

n.taxa <- length(unique(exampleDF$taxa))
for(i in 1:n.taxa) {
  mean.imp(exampleDF[exampleDF$taxa == unique(exampleDF$taxa[i]), "length"])
} # no way to get back into dataframe in proper places, try plyr?

another attempt:

imp.mean <- function(x) {
  a <- mean(x, na.rm = TRUE)
  return (ifelse (is.na(x) == TRUE , a, x)) 
 } # tried but not sure how to use this in ddply

Diet2 <- ddply(exampleDF, .(taxa), transform, length2 = function(x) {
  a <- mean(exampleDF$length, na.rm = TRUE)
  return (ifelse (is.na(exampleDF$length) == TRUE , a, exampleDF$length)) 
  })

Any suggestions?

314

asked Feb 17 '12 04:02

djhocking

1 Answers

Not my own technique I saw it on the boards a while back:

dat <- read.table(text = "id    taxa        length  width
101   collembola  2.1     0.9
102   mite        0.9     0.7
103   mite        1.1     0.8
104   collembola  NA      NA
105   collembola  1.5     0.5
106   mite        NA      NA", header=TRUE)


library(plyr)
impute.mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
dat2 <- ddply(dat, ~ taxa, transform, length = impute.mean(length),
     width = impute.mean(width))

dat2[order(dat2$id), ] #plyr orders by group so we have to reorder

Edit A non plyr approach with a for loop:

for (i in which(sapply(dat, is.numeric))) {
    for (j in which(is.na(dat[, i]))) {
        dat[j, i] <- mean(dat[dat[, "taxa"] == dat[j, "taxa"], i],  na.rm = TRUE)
    }
}

Edit many moons later here is a data.table & dplyr approach:

data.table

library(data.table)
setDT(dat)

dat[, length := impute.mean(length), by = taxa][,
    width := impute.mean(width), by = taxa]

dplyr

library(dplyr)

dat %>%
    group_by(taxa) %>%
    mutate(
        length = impute.mean(length),
        width = impute.mean(width)  
    )

answered Oct 13 '22 04:10

Tyler Rinker

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to replace NA with mean by group / subset?

Tags:

djhocking

People also ask

1 Answers

Tyler Rinker

Recent Activity

Donate For Us

How to replace NA with mean by group / subset?

Tags:

djhocking

People also ask

1 Answers

Tyler Rinker

Related questions

Recent Activity

Donate For Us