I have the following table: <pre class="prettyprint"><code>FN LN LN1 LN2 LN3 LN4 LN5 a b b x x x x a c b d e NA NA a d c a b x x a e b c d x e </code></pre> I'm filtering records for which LN is present in LN1 to LN5. The code I used: <pre class="prettyprint"><code>testFilter = filter(test, LN %in% c(LN1, LN2, LN3, LN4, LN5)) </code></pre> The result is not what I expect: <pre class="prettyprint"><code>ï..FN LN LN1 LN2 LN3 LN4 LN5 1 a b b x x x x 2 a c b d e <NA> <NA> 3 a d c a b x x 4 a e b c d x e </code></pre> I understand that <code>c(LN1, LN2, LN3, LN4, LN5)</code> gives: <code>"b" "b" "c" "b" "x" "d" "a" "c" "x" "e" "b" "d" "x" NA "x" "x" "x" NA "x" "e"</code> and know this is where the mistake is. Ideally, I want to return only the 1st and 4th record. <pre class="prettyprint"><code>FN LN LN1 LN2 LN3 LN4 LN5 a b b x x x x a e b c d x e </code></pre> I want to filter them only using column names. This is just a subset of 5.4M records.

There is an alternative approach using <code>data.table</code> and <code>Reduce()</code>: <pre class="prettyprint"><code>library(data.table) cols <- paste0("LN", 1:5) setDT(test)[test[, .I[Reduce(`|`, lapply(.SD, function(x) !is.na(x) & LN == x))], .SDcols = cols]] </code></pre> <blockquote> <pre class="prettyprint"><code> FN LN LN1 LN2 LN3 LN4 LN5 1: a b b x x x x 2: a e b c d x e </code></pre> </blockquote> <h3>Data</h3> <pre class="prettyprint"><code>library(data.table) test <- fread( "FN LN LN1 LN2 LN3 LN4 LN5 a b b x x x x a c b d e NA NA a d c a b x x a e b c d x e") </code></pre> <h3>Benchmark</h3> <pre class="prettyprint"><code>library(data.table) library(dplyr) n_row <- 1e6L set.seed(123L) DT <- data.table( FN = "a", LN = sample(letters, n_row, TRUE)) cols <- paste0("LN", 1:5) DT[, (cols) := lapply(1:5, function(x) sample(c(letters, NA), n_row, TRUE))] DT df1 <- as.data.frame(DT) bm <- microbenchmark::microbenchmark( zx8754 = { df1[ apply(df1, 1, function(i) i[2] %in% i[3:7]), ] }, eric = { df1[ which(df1$LN == df1$LN1 | df1$LN == df1$LN2 | df1$LN == df1$LN3 | df1$LN == df1$LN4 | df1$LN == df1$LN5), ] }, uwe = { DT[DT[, .I[Reduce(`|`, lapply(.SD, function(x) !is.na(x) & LN == x))], .SDcols = cols]] }, axe = { filter_at(df1, vars(num_range("LN", 1:5)), any_vars(. == LN)) }, jaap = {df1[!!rowSums(df1$LN == df1[, 3:7], na.rm = TRUE),]}, times = 50L ) print(bm, "ms") </code></pre> <blockquote> <pre class="prettyprint"><code>Unit: milliseconds expr min lq mean median uq max neval cld zx8754 3120.68925 3330.12289 3508.03001 3460.83459 3589.10255 4552.9070 50 c eric 69.74435 79.11995 101.80188 83.78996 98.24054 309.3864 50 a uwe 93.26621 115.30266 130.91483 121.64281 131.75704 292.8094 50 a axe 69.82137 79.54149 96.70102 81.98631 95.77107 315.3111 50 a jaap 362.39318 489.86989 543.39510 544.13079 570.10874 1110.1317 50 b </code></pre> </blockquote> For 1 M rows, the hard coded subsetting is the fastest, followed by the <code>data.table</code>/<code>Reduce()</code> and <code>dplyr</code>/<code>filter_at</code> approaches. Using <code>apply()</code> is 60 times slower. <pre class="prettyprint"><code>ggplot(bm, aes(expr, time)) + geom_violin() + scale_y_log10() + stat_summary(fun.data = mean_cl_boot) </code></pre> <img src="https://i.stack.imgur.com/LDnoZ.png" alt="enter image description here">

Filtering rows in a dataset by columns

Tags:

dataframe

r

subset

I have the following table:

FN LN LN1 LN2 LN3 LN4 LN5
a   b   b   x   x   x   x
a   c   b   d   e   NA  NA
a   d   c   a   b   x   x
a   e   b   c   d   x   e

I'm filtering records for which LN is present in LN1 to LN5.

The code I used:

testFilter = filter(test, LN %in% c(LN1, LN2, LN3, LN4, LN5))

The result is not what I expect:

ï..FN LN LN1 LN2 LN3  LN4  LN5
1     a  b   b   x   x    x    x
2     a  c   b   d   e <NA> <NA>
3     a  d   c   a   b    x    x
4     a  e   b   c   d    x    e

I understand that c(LN1, LN2, LN3, LN4, LN5) gives: "b" "b" "c" "b" "x" "d" "a" "c" "x" "e" "b" "d" "x" NA "x" "x" "x" NA "x" "e" and know this is where the mistake is.

Ideally, I want to return only the 1st and 4th record.

FN LN LN1 LN2 LN3 LN4 LN5
a   b   b   x   x   x   x
a   e   b   c   d   x   e

I want to filter them only using column names. This is just a subset of 5.4M records.

992

asked Jan 19 '18 08:01

Cena

2 Answers

Using apply:

# data
df1 <- read.table(text = "
FN LN LN1 LN2 LN3 LN4 LN5
a   b   b   x   x   x   x
a   c   b   d   e   NA  NA
a   d   c   a   b   x   x
a   e   b   c   d   x   e", header = TRUE, stringsAsFactors = FALSE)


df1[ apply(df1, 1, function(i) i[2] %in% i[3:7]), ]
#   FN LN LN1 LN2 LN3 LN4 LN5
# 1  a  b   b   x   x   x   x
# 4  a  e   b   c   d   x   e

Note: Consider using other solutions below for big datasets, which can be 60 times faster than this apply solution.

124

answered Sep 23 '22 02:09

zx8754

There is an alternative approach using data.table and Reduce():

library(data.table)
cols <- paste0("LN", 1:5)
setDT(test)[test[, .I[Reduce(`|`, lapply(.SD, function(x) !is.na(x) & LN == x))], 
                 .SDcols = cols]]

   FN LN LN1 LN2 LN3 LN4 LN5
1:  a  b   b   x   x   x   x
2:  a  e   b   c   d   x   e

Data

library(data.table)
test <- fread(
"FN LN LN1 LN2 LN3 LN4 LN5
  a   b   b   x   x   x   x
  a   c   b   d   e   NA  NA
  a   d   c   a   b   x   x
  a   e   b   c   d   x   e")

Benchmark

library(data.table)
library(dplyr)
n_row <- 1e6L
set.seed(123L)
DT <- data.table(
  FN = "a",
  LN = sample(letters, n_row, TRUE))
cols <- paste0("LN", 1:5)
DT[, (cols) := lapply(1:5, function(x) sample(c(letters, NA), n_row, TRUE))]
DT
df1 <- as.data.frame(DT)

bm <- microbenchmark::microbenchmark(
  zx8754 = {
    df1[ apply(df1, 1, function(i) i[2] %in% i[3:7]), ]
  },
  eric = {
    df1[ which(df1$LN == df1$LN1 |
                 df1$LN == df1$LN2 |
                 df1$LN == df1$LN3 |
                 df1$LN == df1$LN4 |
                 df1$LN == df1$LN5), ]
  },
  uwe = {
    DT[DT[, .I[Reduce(`|`, lapply(.SD, function(x) !is.na(x) & LN == x))], 
          .SDcols = cols]]
  },
  axe = { 
    filter_at(df1, vars(num_range("LN", 1:5)), any_vars(. == LN))
  },
  jaap = {df1[!!rowSums(df1$LN == df1[, 3:7], na.rm = TRUE),]},
  times = 50L
)
print(bm, "ms")

Unit: milliseconds
   expr        min         lq       mean     median         uq       max neval cld
 zx8754 3120.68925 3330.12289 3508.03001 3460.83459 3589.10255 4552.9070    50   c
   eric   69.74435   79.11995  101.80188   83.78996   98.24054  309.3864    50 a  
    uwe   93.26621  115.30266  130.91483  121.64281  131.75704  292.8094    50 a  
    axe   69.82137   79.54149   96.70102   81.98631   95.77107  315.3111    50 a  
   jaap  362.39318  489.86989  543.39510  544.13079  570.10874 1110.1317    50  b

For 1 M rows, the hard coded subsetting is the fastest, followed by the data.table/Reduce() and dplyr/filter_at approaches. Using apply() is 60 times slower.

ggplot(bm, aes(expr, time)) + geom_violin() + scale_y_log10() + stat_summary(fun.data = mean_cl_boot)

enter image description here

answered Sep 23 '22 02:09

Uwe

Related questions
                            
                                Type-safety of the R language [closed]
                            
                                How to extract variable names from a netCDF file in R?
                            
                                Add rows to grouped data with dplyr?
                            
                                In R, why is sum so slow compared to others, such as cumsum?
                            
                                Error installing RMySQL (MySQL 5.5.37 in Ubuntu 14.04 )
                            
                                R/ Shiny - How to get current year with Sys.Date()?
                            
                                Width of R code chunk output in RMarkdown files knitr-ed to html
                            
                                Rename only if field exists, otherwise ignore
                            
                                Save a random Forest object
                            
                                How to produce a heatmap with ggplot2?
                            
                                How to annotate a reference line at the same angle as the reference line itself?
                            
                                Convert list to data frame while keeping list-element names
                            
                                How to aggregate some columns while keeping other columns in R?
                            
                                Loop through dataframe column names - R
                            
                                flextable autofit in a Rmarkdown to word doc causes table to go outside page margins
                            
                                Efficiently adding or removing elements to a vector or list in R?
                            
                                Selecting the statistically significant variables in an R glm model
                            
                                How to pass "nothing" as an argument to `[` for subsetting?
                            
                                Create a 100 number vector with random values in R rounded to 2 decimals
                            
                                R: Add text to plots in lower rightern corner outside plot area

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With