It is a toy example.
iris %>%
group_by(Species) %>%
summarise(max = Sepal.Width[Sepal.Length == max(Sepal.Length)])
# A tibble: 3 x 2
Species max
<fct> <dbl>
1 setosa 4
2 versicolor 3.2
3 virginica 3.8
It gives the same output when using which()
.
iris %>%
group_by(Species) %>%
summarise(max = Sepal.Width[which(Sepal.Length == max(Sepal.Length))])
# summarise(max = Sepal.Width[which.max(Sepal.Length)])
# A tibble: 3 x 2
Species max
<fct> <dbl>
1 setosa 4
2 versicolor 3.2
3 virginica 3.8
help(which)
says:
Give the TRUE indices of a logical object, allowing for array indices.
==
does same thing: shows TRUE & FALSE
So when is which()
useful for subsetting?
Method 4: Subsetting in R Using subset() Function subset() function in R programming is used to create a subset of vectors, matrices, or data frames based on the conditions provided in the parameters.
Subsetting in R is a useful indexing feature for accessing object elements. It can be used to select and filter variables and observations. You can use brackets to select rows and columns from your dataframe.
The way you tell R that you want to select some particular elements (i.e., a 'subset') from a vector is by placing an 'index vector' in square brackets immediately following the name of the vector. For a simple example, try x[1:10] to view the first ten elements of x.
There are three subsetting operators, [[ , [ , and $ . Subsetting operators interact differently with different vector types (e.g., atomic vectors, lists, factors, matrices, and data frames). Subsetting can be combined with assignment.
When "=="
ends up with NA
. Try (1:2)[which(c(TRUE, NA))]
v.s. (1:2)[c(TRUE, NA)]
.
If NA
is not removed, indexing by NA
gives NA
(see ?Extract
). However, this removal cannot be done by na.omit
, as otherwise you may get positions of TRUE
potentially wrong. A safe way is to replace NA
by FALSE
then do indexing. But why not just use which
?
Since this question is specifically about subsetting, I thought I would
illustrate some of the performance benefits of using which()
over a
logical subset brought up in the linked question.
When you want to extract the entire subset, there is not much difference in
processing speed, but using which()
needs to allocate less memory. However,if you only want a part of the subset (e.g. to showcase some strange
findings), which()
has a significant speed and memory advantage due to
being able to avoid subsetting a data frame twice by subsetting the result of
which()
instead.
Here are the benchmarks:
df <- ggplot2::diamonds; dim(df)
#> [1] 53940 10
mu <- mean(df$price)
bench::press(
n = c(sum(df$price > mu), 10),
{
i <- seq_len(n)
bench::mark(
logical = df[df$price > mu, ][i, ],
which_1 = df[which(df$price > mu), ][i, ],
which_2 = df[which(df$price > mu)[i], ]
)
}
)
#> Running with:
#> n
#> 1 19657
#> 2 10
#> # A tibble: 6 x 11
#> expression n min mean median max `itr/sec` mem_alloc
#> <chr> <dbl> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 logical 19657 1.5ms 1.81ms 1.71ms 3.39ms 553. 5.5MB
#> 2 which_1 19657 1.41ms 1.61ms 1.56ms 2.41ms 620. 2.89MB
#> 3 which_2 19657 826.56us 934.72us 910.88us 1.41ms 1070. 1.76MB
#> 4 logical 10 893.12us 1.06ms 1.02ms 1.93ms 941. 4.21MB
#> 5 which_1 10 814.4us 944.81us 908.16us 1.78ms 1058. 1.69MB
#> 6 which_2 10 230.72us 264.45us 249.28us 1.08ms 3781. 498.34KB
#> # ... with 3 more variables: n_gc <dbl>, n_itr <int>, total_time <bch:tm>
Created on 2018-08-19 by the reprex package (v0.2.0).
The which
removes the NA
elements. If we need to get the same behavior as which
where there are NA
suse another condition along with
==`
iris %>%
group_by(Species) %>%
summarise(max = Sepal.Width[Sepal.Length == max(Sepal.Length, na.rm = TRUE) &
!is.na(Sepal.Length)])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With