Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

aggregate methods treat missing values (NA) differently

Tags:

r

aggregate

na

Here's a simple data frame with a missing value:

M = data.frame( Name = c('name', 'name'), Col1 = c(NA, 1) , Col2 = c(1, 1)) #   Name Col1 Col2 # 1 name   NA    1 # 2 name    1    1 

When I use aggregate to sum variables by group ('Name') using the formula method:

aggregate(. ~ Name, M, FUN = sum, na.rm = TRUE)

the result is:

# RowName Col1 Col2 #    name    1    1 

So the entire first row, which have an NA, is ignored. But if use the "non-formula" specification:

aggregate(M[, 2:3], by = list(M$Name), FUN = sum, na.rm = TRUE)

the result is:

# Group.1 Col1 Col2 #    name    1    2 

Here only the (1,1) entry is ignored.

This caused a major debugging headache in one of my codes, since I thought these two calls were equivalent. Is there a good reason why the formula entry method is treated differently?

Thanks.

like image 935
Ryan Walker Avatar asked May 30 '13 19:05

Ryan Walker


People also ask

Why we replace the missing values by Na?

Using replace_with_na() works well when we know the exact value to be replaced, and for which variables we want to replace, providing there are not many variables.

How do you deal with NA values in data?

When dealing with missing data, data scientists can use two primary methods to solve the error: imputation or the removal of data. The imputation method develops reasonable guesses for missing data. It's most useful when the percentage of missing data is low.

Which functions can be used to test a data set for missing or NA values?

nan() Function for Finding Missing values: A logical vector is returned by this function that indicates all the NaN values present. It returns a Boolean value.


2 Answers

Good question, but in my opinion, this shouldn't have caused a major debugging headache because it is documented quite clearly in multiple places in the manual page for aggregate.

First, in the usage section:

## S3 method for class 'formula' aggregate(formula, data, FUN, ...,           subset, na.action = na.omit) 

Later, in the description:

na.action: a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.


I can't answer why the formula mode was written differently---that's something the function authors would have to answer---but using the above information, you can probably use the following:

aggregate(.~Name, M, FUN=sum, na.rm=TRUE, na.action=NULL) #   Name Col1 Col2 # 1 name    1    2 
like image 124
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 11 '22 17:10

A5C1D2H2I1M1N2O1R2T1


If you want the formula version to be equivalent try this:

M = data.frame( Name = rep('name',5), Col1 = c(NA,rep(1,4)) , Col2 = rep(1,5)) aggregate(. ~ Name, M, function(x) sum(x, na.rm=TRUE), na.action = na.pass) 
like image 23
Rorschach Avatar answered Oct 11 '22 15:10

Rorschach