How to select specific columns containing certain strings/characters?

Tags:

I have this dataframe:

df1 <- data.frame(a = c("correct", "wrong", "wrong", "correct"),
  b = c(1, 2, 3, 4),
  c = c("wrong", "wrong", "wrong", "wrong"),
  d = c(2, 2, 3, 4))

a       b c     d
correct 1 wrong 2
wrong   2 wrong 2
wrong   3 wrong 3
correct 4 wrong 4

and would like to select only the columns with either the strings 'correct' or 'wrong' (i.e., columns b and d in df1), such that I get this dataframe:

df2 <- data.frame(a = c("correct", "wrong", "wrong", "correct"),
        c = c("wrong", "wrong", "wrong", "wrong"))

        a     c
1 correct wrong
2   wrong wrong
3   wrong wrong
4 correct wrong

Can I use dplyr to do this? If not, what function(s) can I use to do this? The example I've given is straightforward, in that I can just do this (dplyr):

select(df1, a, c)

However, in my actual dataframe, I have about 700 variables/columns and a few hundred columns that contain the strings 'correct' or 'wrong' and I don't know the variable/column names.

Any suggestions as to how to do this quickly? Thanks a lot!

939

asked Apr 25 '15 12:04

hsl

2 Answers

You can use base R Filter which will operate on each of df1's columns and keep all ones satisfying the logical test in the function:

Filter(function(u) any(c('wrong','correct') %in% u), df1)
#        a     c
#1 correct wrong
#2   wrong wrong
#3   wrong wrong
#4 correct wrong

You can also use grepl:

Filter(function(u) any(grepl('wrong|correct',u)), df1)

124

answered Dec 09 '22 22:12

Colonel Beauvel

---- update ----- Thanks Colonel Beavel. What an elegant solution. I will def use Filter more.

I want to check a speed solution too just in case time is an important factor:

locator <- apply(df1, 2, function(x) grepl("correct|wrong", x))
index <- apply(locator, 2, any)
newdf <- df1[,!index]

I expanded your data frame to 500,000 columns:

dftest <- as.data.frame(replicate(500000, df1[,1]))

Then tested the system time for a function with apply, Filter with grepl, and Filter with pattern %in%:

f <- function() {
locator <- apply(dftest, 2, function(x) grepl("correct|wrong", x))
index <- apply(locator, 2, any)
newdf <- dftest[,!index]
}

f1 <- function() {newdf <- (Filter(function(x) any(c("wrong", "correct") %in% x), dftest))}

f2 <- function() {newdf <- Filter(function(u) any(grepl('wrong|correct',u)), dftest)}


system.time(f())
   user  system elapsed 
   24.32    0.00   24.35 
system.time(f1())
   user  system elapsed 
   2.31    0.00    2.34 
system.time(f2())
   user  system elapsed 
   8.66    0.01    8.71

Colonel's solution is by far the best one. It's clean and performs best. --credit @akrun for data.frame suggestion.

answered Dec 09 '22 21:12

Pierre L

Related questions
                            
                                List of unique words from data.frame
                            
                                Code for type="h" in ggplot2
                            
                                How to stop a command in R in Windows
                            
                                Random sequence from fixed ensemble that contains at least one of each character
                            
                                How to run a robit model in Stan?
                            
                                Resampling from subject id's in R
                            
                                Getting R to use newer versions of java
                            
                                How do you sample groups in a data.table with a caveat
                            
                                Solving error message "step halving factor reduced below minimum in NLS step": adjusting nlsTols not working
                            
                                Filtering reactive data set in shiny R
                            
                                Compare every *nd symbol of a text string
                            
                                How to avoid linebreak in R's sprintf("very very long string with line break")?
                            
                                Kolmogorov-Smirnov test
                            
                                Progressively find most frequent item in list in R
                            
                                data.table setnames combined with regex
                            
                                Set lines to different transparency
                            
                                Evaluation of Reverse Polish Notation in R
                            
                                R equivalent to the Python function "dir"?
                            
                                apply() not working when checking column class in a data.frame
                            
                                looping nested lists in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to select specific columns containing certain strings/characters?

Tags:

dataframe

r

dplyr

hsl

People also ask

2 Answers

Colonel Beauvel

Pierre L

Recent Activity

Donate For Us