Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select specific columns containing certain strings/characters?

Tags:

dataframe

r

dplyr

I have this dataframe:

df1 <- data.frame(a = c("correct", "wrong", "wrong", "correct"),
  b = c(1, 2, 3, 4),
  c = c("wrong", "wrong", "wrong", "wrong"),
  d = c(2, 2, 3, 4))

a       b c     d
correct 1 wrong 2
wrong   2 wrong 2
wrong   3 wrong 3
correct 4 wrong 4

and would like to select only the columns with either the strings 'correct' or 'wrong' (i.e., columns b and d in df1), such that I get this dataframe:

df2 <- data.frame(a = c("correct", "wrong", "wrong", "correct"),
        c = c("wrong", "wrong", "wrong", "wrong"))

        a     c
1 correct wrong
2   wrong wrong
3   wrong wrong
4 correct wrong

Can I use dplyr to do this? If not, what function(s) can I use to do this? The example I've given is straightforward, in that I can just do this (dplyr):

select(df1, a, c)

However, in my actual dataframe, I have about 700 variables/columns and a few hundred columns that contain the strings 'correct' or 'wrong' and I don't know the variable/column names.

Any suggestions as to how to do this quickly? Thanks a lot!

like image 939
hsl Avatar asked Apr 25 '15 12:04

hsl


People also ask

How do I select columns based on column names?

Selecting columns based on their name This is the most basic way to select a single column from a dataframe, just put the string name of the column in brackets. Returns a pandas series. Passing a list in the brackets lets you select multiple columns at the same time.

How do I select only certain columns in a Dataframe?

If you have a DataFrame and would like to access or select a specific few rows/columns from that DataFrame, you can use square brackets or other advanced methods such as loc and iloc .


2 Answers

You can use base R Filter which will operate on each of df1's columns and keep all ones satisfying the logical test in the function:

Filter(function(u) any(c('wrong','correct') %in% u), df1)
#        a     c
#1 correct wrong
#2   wrong wrong
#3   wrong wrong
#4 correct wrong

You can also use grepl:

Filter(function(u) any(grepl('wrong|correct',u)), df1)
like image 124
Colonel Beauvel Avatar answered Dec 09 '22 22:12

Colonel Beauvel


---- update ----- Thanks Colonel Beavel. What an elegant solution. I will def use Filter more.

I want to check a speed solution too just in case time is an important factor:

locator <- apply(df1, 2, function(x) grepl("correct|wrong", x))
index <- apply(locator, 2, any)
newdf <- df1[,!index]

I expanded your data frame to 500,000 columns:

dftest <- as.data.frame(replicate(500000, df1[,1]))

Then tested the system time for a function with apply, Filter with grepl, and Filter with pattern %in%:

f <- function() {
locator <- apply(dftest, 2, function(x) grepl("correct|wrong", x))
index <- apply(locator, 2, any)
newdf <- dftest[,!index]
}

f1 <- function() {newdf <- (Filter(function(x) any(c("wrong", "correct") %in% x), dftest))}

f2 <- function() {newdf <- Filter(function(u) any(grepl('wrong|correct',u)), dftest)}


system.time(f())
   user  system elapsed 
   24.32    0.00   24.35 
system.time(f1())
   user  system elapsed 
   2.31    0.00    2.34 
system.time(f2())
   user  system elapsed 
   8.66    0.01    8.71 

Colonel's solution is by far the best one. It's clean and performs best. --credit @akrun for data.frame suggestion.

like image 30
Pierre L Avatar answered Dec 09 '22 21:12

Pierre L