I have this dataframe:
df1 <- data.frame(a = c("correct", "wrong", "wrong", "correct"),
b = c(1, 2, 3, 4),
c = c("wrong", "wrong", "wrong", "wrong"),
d = c(2, 2, 3, 4))
a b c d
correct 1 wrong 2
wrong 2 wrong 2
wrong 3 wrong 3
correct 4 wrong 4
and would like to select only the columns with either the strings 'correct' or 'wrong' (i.e., columns b and d in df1), such that I get this dataframe:
df2 <- data.frame(a = c("correct", "wrong", "wrong", "correct"),
c = c("wrong", "wrong", "wrong", "wrong"))
a c
1 correct wrong
2 wrong wrong
3 wrong wrong
4 correct wrong
Can I use dplyr to do this? If not, what function(s) can I use to do this? The example I've given is straightforward, in that I can just do this (dplyr):
select(df1, a, c)
However, in my actual dataframe, I have about 700 variables/columns and a few hundred columns that contain the strings 'correct' or 'wrong' and I don't know the variable/column names.
Any suggestions as to how to do this quickly? Thanks a lot!
Selecting columns based on their name This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.
If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc and iloc .
You can use base R
Filter
which will operate on each of df1
's columns and keep all ones satisfying the logical test in the function:
Filter(function(u) any(c('wrong','correct') %in% u), df1)
# a c
#1 correct wrong
#2 wrong wrong
#3 wrong wrong
#4 correct wrong
You can also use grepl
:
Filter(function(u) any(grepl('wrong|correct',u)), df1)
---- update -----
Thanks Colonel Beavel. What an elegant solution. I will def use Filter
more.
I want to check a speed solution too just in case time is an important factor:
locator <- apply(df1, 2, function(x) grepl("correct|wrong", x))
index <- apply(locator, 2, any)
newdf <- df1[,!index]
I expanded your data frame to 500,000 columns:
dftest <- as.data.frame(replicate(500000, df1[,1]))
Then tested the system time for a function with apply
, Filter
with grepl, and Filter
with pattern %in%:
f <- function() {
locator <- apply(dftest, 2, function(x) grepl("correct|wrong", x))
index <- apply(locator, 2, any)
newdf <- dftest[,!index]
}
f1 <- function() {newdf <- (Filter(function(x) any(c("wrong", "correct") %in% x), dftest))}
f2 <- function() {newdf <- Filter(function(u) any(grepl('wrong|correct',u)), dftest)}
system.time(f())
user system elapsed
24.32 0.00 24.35
system.time(f1())
user system elapsed
2.31 0.00 2.34
system.time(f2())
user system elapsed
8.66 0.01 8.71
Colonel's solution is by far the best one. It's clean and performs best. --credit @akrun for data.frame suggestion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With