Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is dplyr removing values not met by condition?

Tags:

r

dplyr

I'm using dplyr to replace the value with NA if a condition is met, but it's putting NA in place where it shouldn't be.

dput:

df <- structure(list(id = c("USC00231275", "USC00231275", "USC00231275", 
"USC00231275", "USC00231275", "USC00231275", "USC00231275", "USC00231275", 
"USC00231275", "USC00231275"), element = c("TMAX", "TMIN", "TMAX", 
"TMIN", "TMAX", "TMIN", "TMAX", "TMIN", "TMAX", "TMIN"), year = c(1937, 
1937, 1937, 1937, 1937, 1937, 1937, 1937, 1937, 1937), month = c(5, 
5, 5, 5, 5, 5, 5, 5, 5, 5), day = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 
5), date = structure(c(-11933, -11933, -11932, -11932, -11931, 
-11931, -11930, -11930, -11929, -11929), class = "Date"), value = c(0, 
53.96, 68, 44.96, 62.06, 53.96, 73.04, 53.96, 69.08, 50)), .Names = c("id", 
"element", "year", "month", "day", "date", "value"), row.names = c(NA, 
10L), class = "data.frame")

data.frame (Note: condition is only met on row 1 and 2)

            id element year month day       date value
1  USC00231275    TMAX 1937     5   1 1937-05-01  0.00
2  USC00231275    TMIN 1937     5   1 1937-05-01 53.96
3  USC00231275    TMAX 1937     5   2 1937-05-02 68.00
4  USC00231275    TMIN 1937     5   2 1937-05-02 44.96
5  USC00231275    TMAX 1937     5   3 1937-05-03 62.06
6  USC00231275    TMIN 1937     5   3 1937-05-03 53.96
7  USC00231275    TMAX 1937     5   4 1937-05-04 73.04
8  USC00231275    TMIN 1937     5   4 1937-05-04 53.96
9  USC00231275    TMAX 1937     5   5 1937-05-05 69.08
10 USC00231275    TMIN 1937     5   5 1937-05-05 50.00

dplyr

df %>%
  group_by(date) %>%
  mutate(
    value = if(value[element == 'TMIN'] >= value[element == 'TMAX'])
      as.numeric(NA) else value
  )

            id element  year month   day       date value
         (chr)   (chr) (dbl) (dbl) (dbl)     (date) (dbl)
1  USC00231275    TMAX  1937     5     1 1937-05-01    NA
2  USC00231275    TMIN  1937     5     1 1937-05-01    NA
3  USC00231275    TMAX  1937     5     2 1937-05-02 68.00
4  USC00231275    TMIN  1937     5     2 1937-05-02 44.96
5  USC00231275    TMAX  1937     5     3 1937-05-03    NA
6  USC00231275    TMIN  1937     5     3 1937-05-03    NA
7  USC00231275    TMAX  1937     5     4 1937-05-04 73.04
8  USC00231275    TMIN  1937     5     4 1937-05-04 53.96
9  USC00231275    TMAX  1937     5     5 1937-05-05 69.08
10 USC00231275    TMIN  1937     5     5 1937-05-05 50.00

Notice that the only rows that should change are 1 and 2, but dplyr changed rows 5 and 6 even though the conditions were not met.

like image 374
Vedda Avatar asked Dec 27 '15 23:12

Vedda


1 Answers

The code below should do what you are trying to do

df %>%
  group_by(date) %>%
  mutate(new_value = ifelse( ( (value[element == 'TMIN'] >= value[element == 'TMAX']) & element=='TMIN'), NA, value)) %>%
  ungroup

For the question of whether this is a bug or not, I don't think it is. Looking at just the data for the one year, where TMIN >= TMAX, you have the following

df %>%
  filter(date == '1937-05-01') %>%
  mutate(res = (value[element == 'TMIN'] >= value[element == 'TMAX'])) %>%
  mutate(new_value = ifelse( (res & element=='TMIN'), NA, value))

           id element year month day       date value  res new_value
1 USC00231275    TMAX 1937     5   1 1937-05-01  0.00 TRUE         0
2 USC00231275    TMIN 1937     5   1 1937-05-01 53.96 TRUE        NA

The construct value[element == 'TMIN'] >= value[element == 'TMAX']) will always be true as can be seen in the res column. The code below breaks this down a bit to hopefully clarify (I hope).

### Just looking at one date
> df2 <- df %>% filter(date == '1937-05-01')
> df2
           id element year month day       date value
1 USC00231275    TMAX 1937     5   1 1937-05-01  0.00
2 USC00231275    TMIN 1937     5   1 1937-05-01 53.96

### This comparison will be recycled for every element in the group,
### so it will always be TRUE or always FALSE.
> c(df2$value[df2$element == 'TMIN'], df2$value[df2$element == 'TMAX'])
[1] 53.96  0.00

Since there is one comparison for the entire group, they will always see TRUE or always FALSE.

The code that gives the correct result shows how the comparison can be gotten around.

One possible final solution could be:

df %>%
   group_by(date) %>%
   mutate(value = ifelse( ( (value[element == 'TMIN'] >= value[element == 'TMAX']) & element=='TMIN'), NA, value)) %>%
   ungroup
like image 176
steveb Avatar answered Nov 05 '22 13:11

steveb