I have the following table:
FN LN LN1 LN2 LN3 LN4 LN5
a b b x x x x
a c b d e NA NA
a d c a b x x
a e b c d x e
I'm filtering records for which LN is present in LN1 to LN5.
The code I used:
testFilter = filter(test, LN %in% c(LN1, LN2, LN3, LN4, LN5))
The result is not what I expect:
ï..FN LN LN1 LN2 LN3 LN4 LN5
1 a b b x x x x
2 a c b d e <NA> <NA>
3 a d c a b x x
4 a e b c d x e
I understand that c(LN1, LN2, LN3, LN4, LN5)
gives: "b" "b" "c" "b" "x" "d" "a" "c" "x" "e" "b" "d" "x" NA "x" "x" "x" NA "x" "e"
and know this is where the mistake is.
Ideally, I want to return only the 1st and 4th record.
FN LN LN1 LN2 LN3 LN4 LN5
a b b x x x x
a e b c d x e
I want to filter them only using column names. This is just a subset of 5.4M records.
Filter Rows by Condition You can use df[df["Courses"] == 'Spark'] to filter rows by a condition in pandas DataFrame. Not that this expression returns a new DataFrame with selected rows.
Using query() to Filter by Column Value in pandas DataFrame. query() function is used to filter rows based on column value in pandas. After applying the expression, it returns a new DataFrame. If you wanted to update the existing DataFrame use inplace=True param.
The syntax of filtering row by one condition is very simple — dataframe[condition]. In Python, the equal operator is ==, double equal sign. Another way of achieving the same result is using Pandas chaining operation.
How to Filter Rows by Column Value Often, you want to find instances of a specific value in your DataFrame. You can easily filter rows based on whether they contain a value or not using the .loc indexing method. For this example, you have a simple DataFrame of random integers arrayed across two columns and 10 rows:
If you decide you want to see a subset of 10 rows and all columns, you can replace the second argument in .iloc [] with a colon: Pandas will interpret the colon to mean all columns, as seen in the output: You can also use a colon to select all rows. Let's return to condition-based filtering with the .query method. 4. How to Filter Rows by Query
Python program to filter rows where ID greater than 2 and college is vvit This function is used to check the condition and give the results. Example 2: filter the data where id > 3.
The filter () method in R can be applied to both grouped and ungrouped data. The expressions include comparison operators (==, >, >= ) , logical operators (&, |, !, xor ()) , range operators (between (), near ()) as well as NA value check against the column values. The subset dataframe has to be retained in a separate variable.
Using apply:
# data
df1 <- read.table(text = "
FN LN LN1 LN2 LN3 LN4 LN5
a b b x x x x
a c b d e NA NA
a d c a b x x
a e b c d x e", header = TRUE, stringsAsFactors = FALSE)
df1[ apply(df1, 1, function(i) i[2] %in% i[3:7]), ]
# FN LN LN1 LN2 LN3 LN4 LN5
# 1 a b b x x x x
# 4 a e b c d x e
Note: Consider using other solutions below for big datasets, which can be 60 times faster than this apply solution.
There is an alternative approach using data.table
and Reduce()
:
library(data.table)
cols <- paste0("LN", 1:5)
setDT(test)[test[, .I[Reduce(`|`, lapply(.SD, function(x) !is.na(x) & LN == x))],
.SDcols = cols]]
FN LN LN1 LN2 LN3 LN4 LN5 1: a b b x x x x 2: a e b c d x e
library(data.table)
test <- fread(
"FN LN LN1 LN2 LN3 LN4 LN5
a b b x x x x
a c b d e NA NA
a d c a b x x
a e b c d x e")
library(data.table)
library(dplyr)
n_row <- 1e6L
set.seed(123L)
DT <- data.table(
FN = "a",
LN = sample(letters, n_row, TRUE))
cols <- paste0("LN", 1:5)
DT[, (cols) := lapply(1:5, function(x) sample(c(letters, NA), n_row, TRUE))]
DT
df1 <- as.data.frame(DT)
bm <- microbenchmark::microbenchmark(
zx8754 = {
df1[ apply(df1, 1, function(i) i[2] %in% i[3:7]), ]
},
eric = {
df1[ which(df1$LN == df1$LN1 |
df1$LN == df1$LN2 |
df1$LN == df1$LN3 |
df1$LN == df1$LN4 |
df1$LN == df1$LN5), ]
},
uwe = {
DT[DT[, .I[Reduce(`|`, lapply(.SD, function(x) !is.na(x) & LN == x))],
.SDcols = cols]]
},
axe = {
filter_at(df1, vars(num_range("LN", 1:5)), any_vars(. == LN))
},
jaap = {df1[!!rowSums(df1$LN == df1[, 3:7], na.rm = TRUE),]},
times = 50L
)
print(bm, "ms")
Unit: milliseconds expr min lq mean median uq max neval cld zx8754 3120.68925 3330.12289 3508.03001 3460.83459 3589.10255 4552.9070 50 c eric 69.74435 79.11995 101.80188 83.78996 98.24054 309.3864 50 a uwe 93.26621 115.30266 130.91483 121.64281 131.75704 292.8094 50 a axe 69.82137 79.54149 96.70102 81.98631 95.77107 315.3111 50 a jaap 362.39318 489.86989 543.39510 544.13079 570.10874 1110.1317 50 b
For 1 M rows, the hard coded subsetting is the fastest, followed by the data.table
/Reduce()
and dplyr
/filter_at
approaches. Using apply()
is 60 times slower.
ggplot(bm, aes(expr, time)) + geom_violin() + scale_y_log10() + stat_summary(fun.data = mean_cl_boot)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With