Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract a substring by inverse pattern with R?

Tags:

string

regex

r

I trying to extract a substring by pattern using gsub() R function.

# Example: extracting "7 years" substring.
string <- "Psychologist - 7 years on the website, online"
gsub(pattern="[0-9]+\\s+\\w+", replacement="", string)`

`[1] "Psychologist -  on the website, online"

As you can see, it's easy to exlude needed substring using gsub(), but I need to inverse the result and getting "7 years" only. I think about using "^", something like that:

gsub(pattern="[^[0-9]+\\s+\\w+]", replacement="", string)

Please, could anyone help me with correct regexp pattern?

like image 807
Michael Avatar asked Oct 26 '17 10:10

Michael


People also ask

How do I extract a substring in R?

Extracting Substrings from a Character Vector in R Programming – substring() Function. substring() function in R Programming Language is used to extract substrings in a character vector. You can easily extract the required substring or character from the given string.

How does GSUB work in R?

The gsub() function in R is used for replacement operations. The functions takes the input and substitutes it against the specified values. The gsub() function always deals with regular expressions. You can use the regular expressions as the parameter of substitution.

How do I select a character from a string in R?

To get access to the individual characters in an R string, you need to use the substr function: str = 'string' substr(str, 1, 1) # This evaluates to 's'.


2 Answers

You may use

sub(pattern=".*?([0-9]+\\s+\\w+).*", replacement="\\1", string)

See this R demo.

Details

  • .*? - any 0+ chars, as few as possible
  • ([0-9]+\\s+\\w+) - Capturing group 1:
    • [0-9]+ - one or more digits
    • \\s+ - 1 or more whitespaces
    • \\w+ - 1 or more word chars
  • .* - the rest of the string (any 0+ chars, as many as possible)

The \1 in the replacement replaces with the contents of Group 1.

like image 66
Wiktor Stribiżew Avatar answered Sep 28 '22 10:09

Wiktor Stribiżew


You could use the opposite of \d, which is \D in R:

string <- "Psychologist - 7 years on the website, online"
sub(pattern = "\\D*(\\d+\\s+\\w+).*", replacement = "\\1", string)
# [1] "7 years"

\D* means: no digits as long as possible, the rest is captured in a group and then replaces the complete string.

See a demo on regex101.com.

like image 34
Jan Avatar answered Sep 28 '22 09:09

Jan