Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Filtering observations in dplyr in combination with grepl

I am trying to work out how to filter some observations from a large dataset using dplyr and grepl . I am not wedded to grepl, if other solutions would be more optimal.

Take this sample df:

df1 <- data.frame(fruit=c("apple", "orange", "xapple", "xorange",                            "applexx", "orangexx", "banxana", "appxxle"), group=c("A", "B") ) df1   #     fruit group #1    apple     A #2   orange     B #3   xapple     A #4  xorange     B #5  applexx     A #6 orangexx     B #7  banxana     A #8  appxxle     B 

I want to:

  1. filter out those cases beginning with 'x'
  2. filter out those cases ending with 'xx'

I have managed to work out how to get rid of everything that contains 'x' or 'xx', but not beginning with or ending with. Here is how to get rid of everything with 'xx' inside (not just ending with):

df1 %>%  filter(!grepl("xx",fruit))  #    fruit group #1   apple     A #2  orange     B #3  xapple     A #4 xorange     B #5 banxana     A 

This obviously 'erroneously' (from my point of view) filtered 'appxxle'.

I have never fully got to grips with regular expressions. I've been trying to modify code such as: grepl("^(?!x).*$", df1$fruit, perl = TRUE) to try and make it work within the filter command, but am not quite getting it.

Expected output:

#      fruit group #1     apple     A #2    orange     B #3   banxana     A #4   appxxle     B 

I'd like to do this inside dplyr if possible.

like image 459
jalapic Avatar asked Sep 23 '14 15:09

jalapic


People also ask

How does filter () work in dplyr?

filter () and the rest of the functions of dplyr all essentially work in the same way. When you use the dplyr functions, there’s a dataframe that you want to operate on.

What is the use of group by in dplyr?

The group_by () function in dplyr allows you to perform functions on a subset of a dataset without having to create multiple new objects or construct for () loops. The combination of group_by () and summarise () are great for generating simple summaries (counts, sums) of grouped data.

How many functions are there in dplyr toolkit?

In fact, there are only 5 primary functions in the dplyr toolkit: 1 filter() … for filtering rows 2 select() … for selecting columns 3 mutate() … for adding new variables 4 summarise() … for calculating summary stats 5 arrange() … for sorting data

What is the dplyr package in R?

The dplyr package in R offers one of the most comprehensive group of functions to perform common manipulation tasks. In addition, the dplyr functions are often of a simpler syntax than most other data manipulation functions in R. There are several elements of dplyr that are unique to the library, and that do very cool things!


1 Answers

I didn't understand your second regex, but this more basic regex seems to do the trick:

df1 %>% filter(!grepl("^x|xx$", fruit)) ###     fruit group 1   apple     A 2  orange     B 3 banxana     A 4 appxxle     B 

And I assume you know this, but you don't have to use dplyr here at all:

df1[!grepl("^x|xx$", df1$fruit), ] ###     fruit group 1   apple     A 2  orange     B 7 banxana     A 8 appxxle     B 

The regex is looking for strings that start with x OR end with xx. The ^ and $ are regex anchors for the beginning and ending of the string respectively. | is the OR operator. We're negating the results of grepl with the ! so we're finding strings that don't match what's inside the regex.

like image 74
Chase Avatar answered Sep 21 '22 05:09

Chase