Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dplyr produces NaN while base R produces NA

Tags:

r

nan

na

dplyr

Consider the following toy data and computations:

library(dplyr)

df <-  tibble(x = 1)

stats::sd(df$x)

dplyr::summarise(df, sd_x = sd(x))

The first calculation results in NA whereas the second, when the calculation is included in the dplyr function summarise produces NaN. I would expect both calculations to generate the same result and I wonder why they differ?

like image 244
ricke Avatar asked Dec 14 '17 13:12

ricke


People also ask

Why am I getting NaN in R?

In R, NaN stands for Not a Number. Typically NaN values occur when you attempt to perform some calculation that results in an invalid result. Note that NaN values are different from NA values, which simply represent missing values.

How do I get rid of NaN in R?

The NaN values are referred to as the Not A Number in R. It is also called undefined or unrepresentable but it belongs to numeric data type for the values that are not numeric, especially in case of floating-point arithmetic. To remove rows from data frame in R that contains NaN, we can use the function na. omit.

How do you replace NaN with 0 in Dplyr?

You can replace NA values with zero(0) on numeric columns of R data frame by using is.na() , replace() , imputeTS::replace() , dplyr::coalesce() , dplyr::mutate_at() , dplyr::mutate_if() , and tidyr::replace_na() functions.

Is NaN () in R?

is. nan() Function in R Language is used to check if the vector contains any NaN(Not a Number) value as element. It returns a boolean value for all the elements of the vector.


1 Answers

It is calling a different function. I'm not clear what the function is, but it is not the stats one.

dplyr::summarise(df, sd_x = stats::sd(x))
# A tibble: 1 x 1
   sd_x
  <dbl>
1    NA

debugonce(sd) # debug to see when sd is called

Not called here:

dplyr::summarise(df, sd_x = sd(x))
# A tibble: 1 x 1
   sd_x
  <dbl>
1   NaN

But called here:

dplyr::summarise(df, sd_x = stats::sd(x))
debugging in: stats::sd(1)
debug: sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
    na.rm = na.rm))
...

Update

It appears that the sd within summarise gets calculated outside of R, hinted at in this header file: https://github.com/tidyverse/dplyr/blob/master/inst/include/dplyr/Result/Sd.h

A number of functions seem to be redefined by dplyr. Given that var gives the same result in both cases, I think the sd behaviour is a bug.

like image 152
James Avatar answered Nov 07 '22 11:11

James