Consider the following data frame: <pre class="prettyprint"><code>df <- data.frame(replicate(5,sample(1:10,10,rep=TRUE))) # X1 X2 X3 X4 X5 #1 7 9 8 4 10 #2 2 4 9 4 9 #3 2 7 8 8 6 #4 8 9 6 6 4 #5 5 2 1 4 6 #6 8 2 2 1 7 #7 3 8 6 1 6 #8 3 8 5 9 8 #9 6 2 3 10 7 #10 2 7 4 2 9 </code></pre> Using <code>dplyr</code>, how can I filter, on each column (without implicitly naming them), for all values greater than 2. Something that would mimic an hypothetical <code>filter_each(funs(. >= 2))</code> Right now I'm doing: <pre class="prettyprint"><code>df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2, X5 >= 2) </code></pre> Which is equivalent to: <pre class="prettyprint"><code>df %>% filter(!rowSums(. < 2)) </code></pre> Note: Let's say I wanted to filter only on the first 4 columns, I would do: <pre class="prettyprint"><code>df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2) </code></pre> or <pre class="prettyprint"><code>df %>% filter(!rowSums(.[-5] < 2)) </code></pre> Would there be a more efficient alternative ? Edit: sub question How to specify a column name and mimic an hypothethical <code>filter_each(funs(. >= 2), -X5)</code> ? Benchmark sub question Since I have to run this on a large dataset, I benchmarked the suggestions. <pre class="prettyprint"><code>df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE))) mbm <- microbenchmark( Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)), Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)), Docendo = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))), times = 50 ) </code></pre> Here are the results: <pre class="prettyprint"><code>#Unit: milliseconds # expr min lq mean median uq max neval # Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458 50 # Richard 1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669 50 # Docendo 874.0247 933.1399 983.5435 985.3697 1026.901 1053.407 50 </code></pre> <img src="https://i.stack.imgur.com/t1PQM.png" alt="enter image description here">

Here's an idea that makes it fairly simple to choose the names. You can set up a list of calls to send to the <code>.dots</code> argument of <code>filter_()</code>. First a function that creates an unevaluated call. <pre class="prettyprint"><code>Call <- function(x, value, fun = ">=") call(fun, as.name(x), value) </code></pre> Now we use <code>filter_()</code>, passing a list of calls into the <code>.dots</code> argument using <code>lapply()</code>, choosing any name and value you want. <pre class="prettyprint"><code>nm <- names(df) != "X5" filter_(df, .dots = lapply(names(df)[nm], Call, 2L)) # X1 X2 X3 X4 X5 # 1 6 5 7 3 1 # 2 8 10 3 6 5 # 3 5 7 10 2 5 # 4 3 4 2 9 9 # 5 8 3 5 6 2 # 6 9 3 4 10 9 # 7 2 9 7 9 8 </code></pre> You can have a look at the unevaluated calls created by <code>Call()</code>, for example <code>X4</code> and <code>X5</code>, with <pre class="prettyprint"><code>lapply(names(df)[4:5], Call, 2L) # [[1]] # X4 >= 2L # # [[2]] # X5 >= 2L </code></pre> So if you adjust the <code>names()</code> in the <code>X</code> argument of <code>lapply()</code>, you should be fine.

<blockquote> How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ? </blockquote> It might be not the most elegant solution, but it gets the job done: <pre class="prettyprint"><code>df %>% filter(!rowSums(.[,!colnames(.)%in%'X5',drop=F] < 2)) </code></pre> In case of several excluded columns (e.g. X3,X5), one can use: <pre class="prettyprint"><code>df %>% filter(!rowSums(.[,!colnames(.)%in%c('X3','X5'),drop=F] < 2)) </code></pre>

Here's another option with <code>slice</code> which can be used similarly to <code>filter</code> in this case. Main difference is that you supply an integer vector to <code>slice</code> whereas <code>filter</code> takes a logical vector. <pre class="prettyprint"><code>df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))) </code></pre> What I like about this approach is that because we use <code>select</code> inside <code>rowSums</code> you can make use of all the special functions that <code>select</code> supplies, like <code>matches</code> for example. <hr> Let's see how it compares to the other answers: <pre class="prettyprint"><code>df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE))) mbm <- microbenchmark( Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)), Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)), dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))), times = 50L, unit = "relative" ) #Unit: relative # expr min lq median uq max neval # Marat 1.304216 1.290695 1.290127 1.288473 1.290609 50 # Richard 1.139796 1.146942 1.124295 1.159715 1.160689 50 # dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50 </code></pre> <img src="https://i.imgur.com/KoAZXfv.png?1" alt="pic"> Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L). <hr> Following a comment that base R would have the same speed as the <code>slice</code> approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used: <pre class="prettyprint"><code>base = df[!rowSums(df[-5L] < 2L), ], base_which = df[which(!rowSums(df[-5L] < 2L)), ] </code></pre> Benchmark: <pre class="prettyprint"><code>df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE))) mbm <- microbenchmark( Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)), Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)), dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))), base = df[!rowSums(df[-5L] < 2L), ], base_which = df[which(!rowSums(df[-5L] < 2L)), ], times = 50L, unit = "relative" ) #Unit: relative # expr min lq median uq max neval # Marat 1.265692 1.279057 1.298513 1.279167 1.203794 50 # Richard 1.124045 1.160075 1.163240 1.169573 1.076267 50 # dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000 50 # base 2.784058 2.769062 2.710305 2.669699 2.576825 50 # base_which 1.458339 1.477679 1.451617 1.419686 1.412090 50 </code></pre> <img src="https://i.imgur.com/Mx1Wcal.png" alt="pic2"> Not really any better or comparable performance with these two base R approaches. Edit note #2: added benchmark with base R options.

Filter each column of a data.frame based on a specific value

Tags:

r

dplyr

Consider the following data frame:

df <- data.frame(replicate(5,sample(1:10,10,rep=TRUE)))

#   X1 X2 X3 X4 X5
#1   7  9  8  4 10
#2   2  4  9  4  9
#3   2  7  8  8  6
#4   8  9  6  6  4
#5   5  2  1  4  6
#6   8  2  2  1  7
#7   3  8  6  1  6
#8   3  8  5  9  8
#9   6  2  3 10  7
#10  2  7  4  2  9

Using dplyr, how can I filter, on each column (without implicitly naming them), for all values greater than 2.

Something that would mimic an hypothetical filter_each(funs(. >= 2))

Right now I'm doing:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2, X5 >= 2)

Which is equivalent to:

df %>% filter(!rowSums(. < 2))

Note: Let's say I wanted to filter only on the first 4 columns, I would do:

df %>% filter(X1 >= 2, X2 >= 2, X3 >= 2, X4 >= 2)

df %>% filter(!rowSums(.[-5] < 2))

Would there be a more efficient alternative ?

Edit: sub question

How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?

Benchmark sub question

Since I have to run this on a large dataset, I benchmarked the suggestions.

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
Docendo = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
times = 50
)

Here are the results:

#Unit: milliseconds
#    expr       min        lq      mean    median       uq      max neval
#   Marat 1209.1235 1320.3233 1358.7994 1362.0590 1390.342 1448.458    50
# Richard 1151.7691 1196.3060 1222.9900 1216.3936 1256.191 1266.669    50
# Docendo  874.0247  933.1399  983.5435  985.3697 1026.901 1053.407    50

enter image description here

576

asked Jan 28 '15 02:01

Steven Beaupré

3 Answers

Here's an idea that makes it fairly simple to choose the names. You can set up a list of calls to send to the .dots argument of filter_(). First a function that creates an unevaluated call.

Call <- function(x, value, fun = ">=") call(fun, as.name(x), value)

Now we use filter_(), passing a list of calls into the .dots argument using lapply(), choosing any name and value you want.

nm <- names(df) != "X5"
filter_(df, .dots = lapply(names(df)[nm], Call, 2L))
#   X1 X2 X3 X4 X5
# 1  6  5  7  3  1
# 2  8 10  3  6  5
# 3  5  7 10  2  5
# 4  3  4  2  9  9
# 5  8  3  5  6  2
# 6  9  3  4 10  9
# 7  2  9  7  9  8

You can have a look at the unevaluated calls created by Call(), for example X4 and X5, with

lapply(names(df)[4:5], Call, 2L)
# [[1]]
# X4 >= 2L
#
# [[2]]
# X5 >= 2L

So if you adjust the names() in the X argument of lapply(), you should be fine.

128

answered Oct 16 '22 11:10

Rich Scriven

How to specify a column name and mimic an hypothethical filter_each(funs(. >= 2), -X5) ?

It might be not the most elegant solution, but it gets the job done:

df %>% filter(!rowSums(.[,!colnames(.)%in%'X5',drop=F] < 2))

In case of several excluded columns (e.g. X3,X5), one can use:

df %>% filter(!rowSums(.[,!colnames(.)%in%c('X3','X5'),drop=F] < 2))

answered Oct 16 '22 11:10

Marat Talipov

Here's another option with slice which can be used similarly to filter in this case. Main difference is that you supply an integer vector to slice whereas filter takes a logical vector.

df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L)))

What I like about this approach is that because we use select inside rowSums you can make use of all the special functions that select supplies, like matches for example.

Let's see how it compares to the other answers:

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
    Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
    Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
    dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
    times = 50L,
    unit = "relative"
)

#Unit: relative
#     expr      min       lq   median       uq      max neval
#    Marat 1.304216 1.290695 1.290127 1.288473 1.290609    50
#  Richard 1.139796 1.146942 1.124295 1.159715 1.160689    50
# dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000    50

Edit note: updated with more reliable benchmark with 50 repetitions (times = 50L).

Following a comment that base R would have the same speed as the slice approach (without specification of what base R approach is meant exactly), I decided to update my answer with a comparison to base R using almost the same approach as in my answer. For base R I used:

base = df[!rowSums(df[-5L] < 2L), ],
base_which = df[which(!rowSums(df[-5L] < 2L)), ]

Benchmark:

df <- data.frame(replicate(5,sample(1:10,10e6,rep=TRUE)))

mbm <- microbenchmark(
  Marat = df %>% filter(!rowSums(.[,!colnames(.) %in% "X5", drop = FALSE] < 2)),
  Richard = filter_(df, .dots = lapply(names(df)[names(df) != "X5"], function(x, y) { call(">=", as.name(x), y) }, 2)),
  dd_slice = df %>% slice(which(!rowSums(select(., -matches("X5")) < 2L))),
  base = df[!rowSums(df[-5L] < 2L), ],
  base_which = df[which(!rowSums(df[-5L] < 2L)), ],
  times = 50L,
  unit = "relative"
)

#Unit: relative
#       expr      min       lq   median       uq      max neval
#      Marat 1.265692 1.279057 1.298513 1.279167 1.203794    50
#    Richard 1.124045 1.160075 1.163240 1.169573 1.076267    50
#   dd_slice 1.000000 1.000000 1.000000 1.000000 1.000000    50
#       base 2.784058 2.769062 2.710305 2.669699 2.576825    50
# base_which 1.458339 1.477679 1.451617 1.419686 1.412090    50

Not really any better or comparable performance with these two base R approaches.

Edit note #2: added benchmark with base R options.

answered Oct 16 '22 11:10

talat

Related questions
                            
                                Matrix multiplication in Rcpp
                            
                                Custom line style for network graph in R
                            
                                How does R know to use a function, if that functions name has been reassigned to a value?
                            
                                Why does as.character() return an integer on a list of dates?
                            
                                How does software development compare with statistical programming/analysis? [closed]
                            
                                How to specify distance metric while for kmeans in R?
                            
                                R debugging: "only 0's may be mixed with negative subscripts"
                            
                                How to change fontsize in direct.label?
                            
                                Copying list of files from one folder to other in R
                            
                                How to reproduce smoothScatter's outlier plotting in ggplot?
                            
                                Rhtml: Warning: conversion failure on '<var>' in 'mbcsToSbcs': dot substituted for <var>
                            
                                Moving from sourceCpp to a package w/Rcpp
                            
                                R draw (abline + lm) line-of-best-fit through arbitrary point
                            
                                Interpretation of "stat_summary = mean_cl_boot" at ggplot2?
                            
                                Regression and summary statistics by group within a data.table
                            
                                Error with setwd in R
                            
                                grid.arrange using list of plots
                            
                                Side by side Xtables in Rmarkdown
                            
                                How to define more line types for graphs in R (custom linetype)?
                            
                                Adding two vectors by names

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Filter each column of a data.frame based on a specific value

Tags:

r

dplyr

Steven Beaupré

People also ask

3 Answers

Rich Scriven

Marat Talipov

talat

Recent Activity

Donate For Us