When should I use "which" for subsetting?

Tags:

r

It is a toy example.

 iris %>% 
  group_by(Species) %>% 
  summarise(max = Sepal.Width[Sepal.Length == max(Sepal.Length)])

 # A tibble: 3 x 2
  Species      max
  <fct>      <dbl>
1 setosa       4  
2 versicolor   3.2
3 virginica    3.8

It gives the same output when using which().

iris %>% 
  group_by(Species) %>% 
  summarise(max = Sepal.Width[which(Sepal.Length == max(Sepal.Length))])
# summarise(max = Sepal.Width[which.max(Sepal.Length)])

# A tibble: 3 x 2
  Species      max
  <fct>      <dbl>
1 setosa       4  
2 versicolor   3.2
3 virginica    3.8

help(which) says:

Give the TRUE indices of a logical object, allowing for array indices.

== does same thing: shows TRUE & FALSE

So when is which() useful for subsetting?

472

asked Aug 19 '18 03:08

3 Answers

When "==" ends up with NA. Try (1:2)[which(c(TRUE, NA))] v.s. (1:2)[c(TRUE, NA)].

If NA is not removed, indexing by NA gives NA (see ?Extract). However, this removal cannot be done by na.omit, as otherwise you may get positions of TRUE potentially wrong. A safe way is to replace NA by FALSE then do indexing. But why not just use which?

169

answered Nov 15 '22 02:11

Zheyuan Li

Since this question is specifically about subsetting, I thought I would illustrate some of the performance benefits of using which() over a logical subset brought up in the linked question.

When you want to extract the entire subset, there is not much difference in processing speed, but using which() needs to allocate less memory. However,if you only want a part of the subset (e.g. to showcase some strange findings), which() has a significant speed and memory advantage due to being able to avoid subsetting a data frame twice by subsetting the result of which() instead.

Here are the benchmarks:

df <- ggplot2::diamonds; dim(df)
#> [1] 53940    10
mu <- mean(df$price)

bench::press(
  n = c(sum(df$price > mu), 10),
  {
    i <- seq_len(n)
    bench::mark(
      logical = df[df$price > mu, ][i, ],
      which_1 = df[which(df$price > mu), ][i, ],
      which_2 = df[which(df$price > mu)[i], ]
    )
  }
)
#> Running with:
#>       n
#> 1 19657
#> 2    10
#> # A tibble: 6 x 11
#>   expression     n      min     mean   median      max `itr/sec` mem_alloc
#>   <chr>      <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 logical    19657    1.5ms   1.81ms   1.71ms   3.39ms      553.     5.5MB
#> 2 which_1    19657   1.41ms   1.61ms   1.56ms   2.41ms      620.    2.89MB
#> 3 which_2    19657 826.56us 934.72us 910.88us   1.41ms     1070.    1.76MB
#> 4 logical       10 893.12us   1.06ms   1.02ms   1.93ms      941.    4.21MB
#> 5 which_1       10  814.4us 944.81us 908.16us   1.78ms     1058.    1.69MB
#> 6 which_2       10 230.72us 264.45us 249.28us   1.08ms     3781.  498.34KB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>

Created on 2018-08-19 by the reprex package (v0.2.0).

answered Nov 15 '22 02:11

Mikko Marttila

The which removes the NA elements. If we need to get the same behavior as which where there are NAsuse another condition along with==`

iris %>% 
  group_by(Species) %>% 
  summarise(max = Sepal.Width[Sepal.Length == max(Sepal.Length, na.rm = TRUE) & 
                                   !is.na(Sepal.Length)])

answered Nov 15 '22 02:11

akrun

Related questions
                            
                                In R, exactly what causes an object of type name (or symbol) to be evaluated?
                            
                                Why does this simple function calling `lm(..., subset)` fail?
                            
                                How to add jitter in a data frame in R
                            
                                R match and replace column names by data frame
                            
                                Match only exact matches to dplyr matches() helper function
                            
                                Shiny titlepanel: how to put title and image at same height?
                            
                                How to generate a sequence that increments alternately
                            
                                Counting new values not occuring earlier and not occuring in last group
                            
                                R: calculate the number of occurrences of a specific event in a specified time future
                            
                                discretizing viridis ggplot color scale
                            
                                In R how do I find whether an integer is divisible by a number?
                            
                                dplyr for rowwise quantiles
                            
                                How to make gap between x and y axis and protruded ticks in ggplot2
                            
                                Highlight a single "bar" in ggplot
                            
                                Pandas assigning random string to each group as new column
                            
                                Write multiple data frames to csv-file using purrr::map [duplicate]
                            
                                how to scrape all files in a catalog series from the national archives (archives.gov) with R
                            
                                Mapping dates to the viridis colour scale in ggplot2
                            
                                Concatenate unique strings after groupby in R
                            
                                How can I change the labels of these buttons in DT::Datatable in R and change collors of rows?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When should I use "which" for subsetting?

Tags:

r

Wooheon

People also ask

3 Answers

Zheyuan Li

Mikko Marttila

akrun

Recent Activity

Donate For Us