Here's a simple data frame with a missing value:
M = data.frame( Name = c('name', 'name'), Col1 = c(NA, 1) , Col2 = c(1, 1)) # Name Col1 Col2 # 1 name NA 1 # 2 name 1 1
When I use aggregate
to sum
variables by group ('Name') using the formula
method:
aggregate(. ~ Name, M, FUN = sum, na.rm = TRUE)
the result is:
# RowName Col1 Col2 # name 1 1
So the entire first row, which have an NA
, is ignored. But if use the "non-formula
" specification:
aggregate(M[, 2:3], by = list(M$Name), FUN = sum, na.rm = TRUE)
the result is:
# Group.1 Col1 Col2 # name 1 2
Here only the (1,1) entry is ignored.
This caused a major debugging headache in one of my codes, since I thought these two calls were equivalent. Is there a good reason why the formula
entry method is treated differently?
Thanks.
Using replace_with_na() works well when we know the exact value to be replaced, and for which variables we want to replace, providing there are not many variables.
When dealing with missing data, data scientists can use two primary methods to solve the error: imputation or the removal of data. The imputation method develops reasonable guesses for missing data. It's most useful when the percentage of missing data is low.
nan() Function for Finding Missing values: A logical vector is returned by this function that indicates all the NaN values present. It returns a Boolean value.
Good question, but in my opinion, this shouldn't have caused a major debugging headache because it is documented quite clearly in multiple places in the manual page for aggregate
.
First, in the usage section:
## S3 method for class 'formula' aggregate(formula, data, FUN, ..., subset, na.action = na.omit)
Later, in the description:
na.action
: a function which indicates what should happen when the data contain NA values. The default is to ignore missing values in the given variables.
I can't answer why the formula mode was written differently---that's something the function authors would have to answer---but using the above information, you can probably use the following:
aggregate(.~Name, M, FUN=sum, na.rm=TRUE, na.action=NULL) # Name Col1 Col2 # 1 name 1 2
If you want the formula version to be equivalent try this:
M = data.frame( Name = rep('name',5), Col1 = c(NA,rep(1,4)) , Col2 = rep(1,5)) aggregate(. ~ Name, M, function(x) sum(x, na.rm=TRUE), na.action = na.pass)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With