Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R max function returns pseudo values when used within 'dplyr'

Tags:

r

max

dplyr

I used R's max function in combination with the summarise function from the dplyr package and had a typo in the max function's argument na.rm. Mistakenly I wrote ns.rm = T and the script worked without any warning message and returned wrong values. When replacing the na.rm with ns.rm on a simple vector (outside dplyr environment), the function returns the right values, and if the input vector holds NA then it returns an NA value without any warning about wrong argument used.

Here is an example:

if(!require('magrittr')) install.packges('magrittr')
if(!require('dplyr'))    install.packges('dplyr')

tab <- data.frame("grp1" = sort(rep(1:4, 5)), 
                  "grp2" = rep(c(1:2), 10),
                  "val" = rnorm(20))


tab

   grp1 grp2         val
1     1    1  0.03536351
2     1    2  1.04237251
3     1    1  0.82735937
4     1    2  0.29040424
5     1    1  0.30194926
6     2    2 -0.96649026
7     2    1 -0.97388257
8     2    2 -0.13111541
9     2    1 -0.48337864
10    2    2 -0.73471857
11    3    1 -0.88536656
12    3    2 -1.30442575
13    3    1  1.18816751
14    3    2 -0.90334058
15    3    1 -0.53102641
16    4    2 -0.69266762
17    4    1 -0.64776312
18    4    2  0.01354644
19    4    1  0.78058285
20    4    2 -0.06647959
> 
### Using max function within dplyr
## Right way

tab %>% 
  group_by(grp1, grp2) %>% 
  summarise("max_val" = max(val, na.rm = T))

 # A tibble: 8 x 3
    # Groups:   grp1 [4]
       grp1  grp2 max_val
      <int> <int>   <dbl>
    1     1     1  0.827 
    2     1     2  1.04  
    3     2     1 -0.483 
    4     2     2 -0.131 
    5     3     1  1.19  
    6     3     2 -0.903 
    7     4     1  0.781 
    8     4     2  0.0135
## with a typo in na.rm argument    

tab %>% 
   group_by(grp1, grp2) %>% 
   summarise("max_val" = max(val, ns.rm = T))



# A tibble: 8 x 3
# Groups:   grp1 [4]
   grp1  grp2 max_val
  <int> <int>   <dbl>
1     1     1    1   
2     1     2    1.04
3     2     1    1   
4     2     2    1   
5     3     1    1.19
6     3     2    1   
7     4     1    1   
8     4     2    1  
### Using max function on a vector 

max(c(1, 2, 3), ns.rm = T)
[1] 3
max(c(1, 2, 3), ns.rm = T)
[1] 3
max(c(1, 2, 3), na.rm = T)
[1] 3
max(c(1, 2, 3, NA), ns.rm = T)
[1] NA
max(c(1, 2, 3, NA), na.rm = T)
[1] 3

Does anybody know if ns.rm is a legitimate input argument of any R function? If not, why there is no warning that the argument used is not used appropriately?

like image 658
Tal Kozlovski Avatar asked Jan 01 '20 09:01

Tal Kozlovski


1 Answers

No, ns.rm is not a legitimate input argument but what is happening here is ns.rm = T is considered as new input in the vector which is passed in max where T is considered as 1.

max(c(1, 2, 3), ns.rm = T)
#[1] 3

is actually interpreted as

max(c(1, 2, 3), 1)
#[1] 3

and

max(c(0.1, 0.2, 0.33), ns.rm = T)
#[1] 1

is interpreted as

max(c(0.1, 0.2, 0.33), 1)

and

max(c(1, 2, 3, NA), ns.rm = T)
#[1] NA

is actually

max(c(1, 2, 3, NA), 1)
#[1] NA

Similarly, for the dataframe

set.seed(123)
tab <- data.frame(grp1 = sort(rep(1:4, 5)), 
                  grp2 = rep(c(1:2), 10),
                  val = rnorm(20))

By using the right way, we get numbers as

library(dplyr)
tab %>%  group_by(grp1, grp2) %>%  summarise(max_val = max(val, na.rm = T))

#   grp1  grp2 max_val
#  <int> <int>   <dbl>
#1     1     1  1.56  
#2     1     2  0.0705
#3     2     1  0.461 
#4     2     2  1.72  
#5     3     1  1.22  
#6     3     2  0.360 
#7     4     1  0.701 
#8     4     2  1.79  

Now if we use ns.rm = T

tab %>%  group_by(grp1, grp2) %>% summarise(max_val = max(val, ns.rm = T))

#   grp1  grp2 max_val
#  <int> <int>   <dbl>
#1     1     1    1.56
#2     1     2    1   
#3     2     1    1   
#4     2     2    1.72
#5     3     1    1.22
#6     3     2    1   
#7     4     1    1   
#8     4     2    1.79

Notice where max_val was less than 1 in the above groups is now replaced with 1 while using ns.rm since T is interpreted as 1.

Also, note that this is not limited to ns.rm only, you can use any character here.

max(c(0.1, 0.2, 0.33), a = T)
#[1] 1
like image 191
Ronak Shah Avatar answered Nov 17 '22 21:11

Ronak Shah