Using the diamonds
dataset (from the ggplot2
library) as an example, I am trying to subset this table by columns and rows based on a vector of named elements (the names of the vector should be used to subset by columns and the corresponding vector elements by rows).
library(ggplot2)
diamonds
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# … with 53,930 more rows
myVector <- c(cut="Ideal", cut="Good", color="E", color="J")
myVector
cut cut color color
"Ideal" "Good" "E" "J"
What I intend to do, would be something like follows but using myVector
:
library(dplyr)
diamonds %>% subset(., (cut=="Ideal" | cut=="Good") & (color=="E" | color=="J")) %>%
select(cut, color)
The way you tell R that you want to select some particular elements (i.e., a 'subset') from a vector is by placing an 'index vector' in square brackets immediately following the name of the vector. For a simple example, try x[1:10] to view the first ten elements of x.
How to subset the data frame (DataFrame) by column value and name in R? By using R base df[] notation, or subset() you can easily subset the R Data Frame (data. frame) by column value or by column name.
Starting with the split
idea of ThomasIsCoding, slightly changed, here is a base R solution based on having Reduce/Map
created a logical index.
v <- split(unname(myVector), names(myVector))
i <- Reduce('&', Map(function(x, y){x %in% y}, diamonds[names(v)], v))
diamonds[i, ]
## A tibble: 6,039 x 10
# carat cut color clarity depth table price x y z
# <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
# 2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
# 3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
# 4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
# 5 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
# 6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
# 7 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
# 8 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
# 9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
#10 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
## ... with 6,029 more rows
Package dplyr
The code above can be written as a function and used in dplyr::filter
.
# Input:
# X - a data set to be filtered
# values - a named list
values_in <- function(X, values){
v <- split(unname(values), names(values))
i <- Reduce('&', Map(function(x, y){x %in% y}, X[names(v)], v))
i
}
diamonds %>% filter( values_in(., myVector) )
The output is the same as above and, therefore, omited.
I am not sure if you want something like below
u <- split(myVector,names(myVector))
eval(str2expression(sprintf("diamonds %%>%% filter(%s)",paste0(sapply(names(u),function(x) paste0(x," %in% u$",x)),collapse = " & "))))
such that
> eval(str2expression(sprintf("diamonds %%>%% filter(%s)",paste0(sapply(names(u),function(x) paste0(x," %in% u$",x)),collapse = " & "))))
# A tibble: 6,039 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
3 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
4 0.3 Good J SI1 64 55 339 4.25 4.28 2.73
5 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46
6 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
7 0.3 Good J SI1 63.4 54 351 4.23 4.29 2.7
8 0.3 Good J SI1 63.8 56 351 4.23 4.26 2.71
9 0.23 Good E VS1 64.1 59 402 3.83 3.85 2.46
10 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76
# ... with 6,029 more rows
Using both approaches proposed by @Roman (generating all combinations of vector element and joining) and @ThomaslsCoding (splitting the vector) seems to do the trick:
data.frame(split(myVector, names(myVector))) %>%
expand.grid() %>%
inner_join(diamonds[,unique(names(myVector))])
you can try
my_vec_cut = myVector[names(myVector) == "cut"]
my_vec_color = myVector[names(myVector) == "color"]
I splitted the vector in two since you filter for two columns using and
and or
diamonds %>%
filter(.data[[unique(names(my_vec_cut))]] %in% my_vec_cut & .data[[unique(names(my_vec_color))]] %in% my_vec_color)
A general way would be a joining approach. First you build all required combinations from your vector, then you left join the data.
library(tidyverse)
tibble(a=names(myVector), b=myVector) %>%
group_by(a) %>%
mutate(n=1:n()) %>%
pivot_wider(names_from = a, values_from=b) %>%
select(-n) %>%
complete(cut, color)
# A tibble: 4 x 2
cut color
<chr> <chr>
1 Good E
2 Good J
3 Ideal E
4 Ideal J
# now left_joining:
tibble(a=names(myVector), b=myVector) %>%
group_by(a) %>%
mutate(n=1:n()) %>%
pivot_wider(names_from = a, values_from=b) %>%
select(-n) %>%
complete(cut, color) %>%
left_join(diamonds)
count(cut, color)
Similar idea to @ThomasIsCoding's, just in base R.
al <- split(myVector, names(myVector))
res <- with(diamonds, diamonds[eval(parse(text=paste(sapply(names(al), function(x)
paste0(x, " %in% ", "al[['", x, "']]")), collapse=" & "))), ])
unique(res$cut)
# [1] Ideal Good
# Levels: Fair < Good < Very Good < Premium < Ideal
unique(res$color)
# [1] E J
# Levels: D < E < F < G < H < I < J
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With