Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering rows in a dataset by columns

I have the following table:

FN LN LN1 LN2 LN3 LN4 LN5
a   b   b   x   x   x   x
a   c   b   d   e   NA  NA
a   d   c   a   b   x   x
a   e   b   c   d   x   e

I'm filtering records for which LN is present in LN1 to LN5.

The code I used:

testFilter = filter(test, LN %in% c(LN1, LN2, LN3, LN4, LN5)) 

The result is not what I expect:

ï..FN LN LN1 LN2 LN3  LN4  LN5
1     a  b   b   x   x    x    x
2     a  c   b   d   e <NA> <NA>
3     a  d   c   a   b    x    x
4     a  e   b   c   d    x    e

I understand that c(LN1, LN2, LN3, LN4, LN5) gives: "b" "b" "c" "b" "x" "d" "a" "c" "x" "e" "b" "d" "x" NA "x" "x" "x" NA "x" "e" and know this is where the mistake is.

Ideally, I want to return only the 1st and 4th record.

FN LN LN1 LN2 LN3 LN4 LN5
a   b   b   x   x   x   x
a   e   b   c   d   x   e

I want to filter them only using column names. This is just a subset of 5.4M records.

like image 992
Cena Avatar asked Jan 19 '18 08:01

Cena


People also ask

How do I filter specific rows in a data frame?

Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.

How do you filter data based on column values?

Using query() to Filter by Column Value in pandas DataFrame. query() function is used to filter rows based on column value in pandas. After applying the expression, it returns a new DataFrame. If you wanted to update the existing DataFrame use inplace=True param.

How do you filter rows and columns in Python?

The syntax of filtering row by one condition is very simple — dataframe[condition]. In Python, the equal operator is ==, double equal sign. Another way of achieving the same result is using Pandas chaining operation.

How to filter rows by column value in a Dataframe?

How to Filter Rows by Column Value Often, you want to find instances of a specific value in your DataFrame. You can easily filter rows based on whether they contain a value or not using the .loc indexing method. For this example, you have a simple DataFrame of random integers arrayed across two columns and 10 rows:

How to filter rows and columns in pandas?

If you decide you want to see a subset of 10 rows and all columns, you can replace the second argument in .iloc [] with a colon: Pandas will interpret the colon to mean all columns, as seen in the output: You can also use a colon to select all rows. Let's return to condition-based filtering with the .query method. 4. How to Filter Rows by Query

How to filter rows where ID greater than 2 and college?

Python program to filter rows where ID greater than 2 and college is vvit This function is used to check the condition and give the results. Example 2: filter the data where id > 3.

How to use filter () method in R?

The filter () method in R can be applied to both grouped and ungrouped data. The expressions include comparison operators (==, >, >= ) , logical operators (&, |, !, xor ()) , range operators (between (), near ()) as well as NA value check against the column values. The subset dataframe has to be retained in a separate variable.


2 Answers

Using apply:

# data
df1 <- read.table(text = "
FN LN LN1 LN2 LN3 LN4 LN5
a   b   b   x   x   x   x
a   c   b   d   e   NA  NA
a   d   c   a   b   x   x
a   e   b   c   d   x   e", header = TRUE, stringsAsFactors = FALSE)


df1[ apply(df1, 1, function(i) i[2] %in% i[3:7]), ]
#   FN LN LN1 LN2 LN3 LN4 LN5
# 1  a  b   b   x   x   x   x
# 4  a  e   b   c   d   x   e

Note: Consider using other solutions below for big datasets, which can be 60 times faster than this apply solution.

like image 124
zx8754 Avatar answered Sep 23 '22 02:09

zx8754


There is an alternative approach using data.table and Reduce():

library(data.table)
cols <- paste0("LN", 1:5)
setDT(test)[test[, .I[Reduce(`|`, lapply(.SD, function(x) !is.na(x) & LN == x))], 
                 .SDcols = cols]]
   FN LN LN1 LN2 LN3 LN4 LN5
1:  a  b   b   x   x   x   x
2:  a  e   b   c   d   x   e

Data

library(data.table)
test <- fread(
"FN LN LN1 LN2 LN3 LN4 LN5
  a   b   b   x   x   x   x
  a   c   b   d   e   NA  NA
  a   d   c   a   b   x   x
  a   e   b   c   d   x   e")

Benchmark

library(data.table)
library(dplyr)
n_row <- 1e6L
set.seed(123L)
DT <- data.table(
  FN = "a",
  LN = sample(letters, n_row, TRUE))
cols <- paste0("LN", 1:5)
DT[, (cols) := lapply(1:5, function(x) sample(c(letters, NA), n_row, TRUE))]
DT
df1 <- as.data.frame(DT)

bm <- microbenchmark::microbenchmark(
  zx8754 = {
    df1[ apply(df1, 1, function(i) i[2] %in% i[3:7]), ]
  },
  eric = {
    df1[ which(df1$LN == df1$LN1 |
                 df1$LN == df1$LN2 |
                 df1$LN == df1$LN3 |
                 df1$LN == df1$LN4 |
                 df1$LN == df1$LN5), ]
  },
  uwe = {
    DT[DT[, .I[Reduce(`|`, lapply(.SD, function(x) !is.na(x) & LN == x))], 
          .SDcols = cols]]
  },
  axe = { 
    filter_at(df1, vars(num_range("LN", 1:5)), any_vars(. == LN))
  },
  jaap = {df1[!!rowSums(df1$LN == df1[, 3:7], na.rm = TRUE),]},
  times = 50L
)
print(bm, "ms")
Unit: milliseconds
   expr        min         lq       mean     median         uq       max neval cld
 zx8754 3120.68925 3330.12289 3508.03001 3460.83459 3589.10255 4552.9070    50   c
   eric   69.74435   79.11995  101.80188   83.78996   98.24054  309.3864    50 a  
    uwe   93.26621  115.30266  130.91483  121.64281  131.75704  292.8094    50 a  
    axe   69.82137   79.54149   96.70102   81.98631   95.77107  315.3111    50 a  
   jaap  362.39318  489.86989  543.39510  544.13079  570.10874 1110.1317    50  b

For 1 M rows, the hard coded subsetting is the fastest, followed by the data.table/Reduce() and dplyr/filter_at approaches. Using apply() is 60 times slower.

ggplot(bm, aes(expr, time)) + geom_violin() + scale_y_log10() + stat_summary(fun.data = mean_cl_boot)

enter image description here

like image 36
Uwe Avatar answered Sep 23 '22 02:09

Uwe