Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filter multiple values on a string column in dplyr

I have a data.frame with character data in one of the columns. I would like to filter multiple options in the data.frame from the same column. Is there an easy way to do this that I'm missing?

Example: data.frame name = dat

days      name 88        Lynn 11          Tom 2           Chris 5           Lisa 22        Kyla 1          Tom 222      Lynn 2         Lynn 

I'd like to filter out Tom and Lynn for example.
When I do:

target <- c("Tom", "Lynn") filt <- filter(dat, name == target) 

I get this error:

longer object length is not a multiple of shorter object length 
like image 877
Tom O Avatar asked Sep 03 '14 14:09

Tom O


People also ask

How do I filter multiple values in the same column in R?

In this, first, pass your dataframe object to the filter function, then in the condition parameter write the column name in which you want to filter multiple values then put the %in% operator, and then pass a vector containing all the string values which you want in the result.

How do I filter multiple values?

Select Filter the list, in-place option from the Action section; (2.) Then, select the data range that you want to filter in the List range, and specify the list of multiple values you want to filter based on in the Criteria range; (Note: The header name of the filter column and criteria list must be the same.)

How do I filter column data in R?

The filter() method in R can be applied to both grouped and ungrouped data. The expressions include comparison operators (==, >, >= ) , logical operators (&, |, !, xor()) , range operators (between(), near()) as well as NA value check against the column values.


1 Answers

You need %in% instead of ==:

library(dplyr) target <- c("Tom", "Lynn") filter(dat, name %in% target)  # equivalently, dat %>% filter(name %in% target) 

Produces

  days name 1   88 Lynn 2   11  Tom 3    1  Tom 4  222 Lynn 5    2 Lynn 

To understand why, consider what happens here:

dat$name == target # [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE 

Basically, we're recycling the two length target vector four times to match the length of dat$name. In other words, we are doing:

 Lynn == Tom   Tom == Lynn Chris == Tom  Lisa == Lynn  ... continue repeating Tom and Lynn until end of data frame 

In this case we don't get an error because I suspect your data frame actually has a different number of rows that don't allow recycling, but the sample you provide does (8 rows). If the sample had had an odd number of rows I would have gotten the same error as you. But even when recycling works, this is clearly not what you want. Basically, the statement dat$name == target is equivalent to saying:

return TRUE for every odd value that is equal to "Tom" or every even value that is equal to "Lynn".

It so happens that the last value in your sample data frame is even and equal to "Lynn", hence the one TRUE above.

To contrast, dat$name %in% target says:

for each value in dat$name, check that it exists in target.

Very different. Here is the result:

[1]  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE 

Note your problem has nothing to do with dplyr, just the mis-use of ==.

like image 71
BrodieG Avatar answered Oct 25 '22 16:10

BrodieG