Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to extract all text after "--!!" in R dplyr

Tags:

regex

r

dplyr

I am trying to use dplyr in R to extract substrings after a variable string in a dataframe filtered by certain instances of the variable name in the example below. I am trying to pass the desired result into a new variable called income_rent.

I am new to regular expressions. My attempt to do this is:

income_cashrent <- v18 %>% 
filter(str_detect(name, "B25122")) %>% 
mutate(income_rent = str_extract(label, "[^--!!]*$"))

However, I get the result: Error in stri_extract_first_regex(string, pattern, opts_regex = opts(pattern)) : Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

The first four lines of name are:

Estimate!!Total
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent
Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100

The desired result would be:

[not sure how to indicate an empty result here]
Less than $10,000
Less than $10,000!!With cash rent
Less than $10,000!!With cash rent!!Less than $100

I have been thus far unable to debug this, consulting other regex examples on stack. Any guidance would be most welcome. Thanks all in advance!

like image 746
Abe Avatar asked Sep 15 '25 21:09

Abe


1 Answers

We can use str_extract to extract the characters after the pattern--!!` using regex lookaround

library(stringr)
library(dplyr)
 v18 %>%        
     mutate(income_rent = str_extract(label, "(?<=--!!).*"))                                                                                                                                                label
#1                                                                                                                                    Estimate!!Total
#2                                 Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000
#3                 Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent
#4 Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100
 #                                       income_rent
#1                                              <NA>
#2                                 Less than $10,000
#3                 Less than $10,000!!With cash rent
#4 Less than $10,000!!With cash rent!!Less than $100

Or another option is str_match

v18$income_rent <-  str_match(v18$label, ".*--!!(.*)")[,2]

data

v18 <- structure(list(label = c("Estimate!!Total", "Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000", 
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent", 
"Estimate!!Total!!Household income in the past 12 months (in 2018 inflation-adjusted dollars) --!!Less than $10,000!!With cash rent!!Less than $100"
)), class = "data.frame", row.names = c(NA, -4L))
like image 195
akrun Avatar answered Sep 17 '25 14:09

akrun