Consider the following data frame:
df <- data.frame(replicate(5,sample(1:10,10,rep=TRUE)))
# X1 X2 X3 X4 X5
#1 7 9 8 4 10
#2 2 4 9 4 9
#3 2 7 8 8 6
#4 8 9 6 6 4
#5 5 2 1 4 6
#6 8 2 2 1 7
#7 3 8 6 1 6
#8 3 8 5 9 8
#9 6 2 3 10 7
#10 2 7 4 2 9
Using dplyr
, how can I filter, on each column (without implicitly naming them), for all values greater than 2.
Something that would mimic an hypothetical filter_each(funs(. >= 2))
Right now I'm doing:
df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2, X5 >= 2)
Which is equivalent to:
df %>% filter(!rowSums(. < 2))
Note: Let's say I wanted to filter only on the first 4 columns, I would do:
df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2)
or
df %>% filter(!rowSums(.[-5] < 2))
Would there be a more efficient alternative ?
Edit: sub question
How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5)
?
Benchmark sub question
Since I have to run this on a large dataset, I benchmarked the suggestions.
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))
mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
Docendo = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50
)
Here are the results:
#Unit: milliseconds
# expr min lq mean median uq max neval
# Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458 50
# Richard 1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669 50
# Docendo 874.0247 933.1399 983.5435 985.3697 1026.901 1053.407 50
Using query() to Filter by Column Value in pandas DataFrame. query() function is used to filter rows based on column value in pandas. After applying the expression, it returns a new DataFrame. If you wanted to update the existing DataFrame use inplace=True param.
You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.
Use the syntax new_DataFrame = DataFrame[(DataFrame[column]==criteria1) operator (DataFrame[column2]==criteria2)] , where operator is & or | , to filter a pandas. DataFrame by multiple columns.
Here's an idea that makes it fairly simple to choose the names. You can set up a list of calls to send to the .dots
argument of filter_()
. First a function that creates an unevaluated call.
Call <- function(x, value, fun = ">=") call(fun, as.name(x), value)
Now we use filter_()
, passing a list of calls into the .dots
argument using lapply()
, choosing any name and value you want.
nm <- names(df) != "X5"
filter_(df, .dots = lapply(names(df)[nm], Call, 2L))
# X1 X2 X3 X4 X5
# 1 6 5 7 3 1
# 2 8 10 3 6 5
# 3 5 7 10 2 5
# 4 3 4 2 9 9
# 5 8 3 5 6 2
# 6 9 3 4 10 9
# 7 2 9 7 9 8
You can have a look at the unevaluated calls created by Call()
, for example X4
and X5
, with
lapply(names(df)[4:5], Call, 2L)
# [[1]]
# X4 >= 2L
#
# [[2]]
# X5 >= 2L
So if you adjust the names()
in the X
argument of lapply()
, you should be fine.
How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?
It might be not the most elegant solution, but it gets the job done:
df %>% filter(!rowSums(.[,!colnames(.)%in%'X5',drop=F] < 2))
In case of several excluded columns (e.g. X3,X5), one can use:
df %>% filter(!rowSums(.[,!colnames(.)%in%c('X3','X5'),drop=F] < 2))
Here's another option with slice
which can be used similarly to filter
in this case. Main difference is that you supply an integer vector to slice
whereas filter
takes a logical vector.
df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L)))
What I like about this approach is that because we use select
inside rowSums
you can make use of all the special functions that select
supplies, like matches
for example.
Let's see how it compares to the other answers:
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))
mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50L,
unit = "relative"
)
#Unit: relative
# expr min lq median uq max neval
# Marat 1.304216 1.290695 1.290127 1.288473 1.290609 50
# Richard 1.139796 1.146942 1.124295 1.159715 1.160689 50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).
Following a comment that base R would have the same speed as the slice
approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:
base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ]
Benchmark:
df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))
mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ],
times = 50L,
unit = "relative"
)
#Unit: relative
# expr min lq median uq max neval
# Marat 1.265692 1.279057 1.298513 1.279167 1.203794 50
# Richard 1.124045 1.160075 1.163240 1.169573 1.076267 50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50
# base 2.784058 2.769062 2.710305 2.669699 2.576825 50
# base_which 1.458339 1.477679 1.451617 1.419686 1.412090 50
Not really any better or comparable performance with these two base R approaches.
Edit note #2: added benchmark with base R options.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With