Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression to grab word before a certain character R Perl

Tags:

regex

r

perl

I need to get words before and after a unique character (in my case: &) in a string in R.

I need to get 'word1' from something like this: "...something something word1 & word2 something..."

I can get the word after using a Perl regular expression in R: (?<=& )[^ ]*(?= ) (It seems to behave the way I would like. I got it from combing answers I found on this site)

I now need to get the word preceding the & symbol. The length of the words change and the number of other preceding words, and also spaces, change. Word one could be letters and numbers, just bound by spaces on either side.

like image 964
GregS Avatar asked Feb 19 '13 00:02

GregS


People also ask

How do I match a character in Perl?

m operator in Perl is used to match a pattern within the given text. The string passed to m operator can be enclosed within any character which will be used as a delimiter to regular expressions.

What is \W in Perl regex?

A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit) or _ , not a whole word. Use \w+ to match a string of Perl-identifier characters (which isn't the same as matching an English word).

How do you match a character sequence in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).


1 Answers

If you use (\S+)\s*&\s*(\S+) then the words both sides of & will be captured. This allows for optional whitespace around the ampersand.

You need to double-up the backslashes in an R string, and use the regexec and regmatches functions to apply the pattern and extract the matched substrings.

string  <- "...something something word1 & word2 something..."
pattern <- "(\\S+)\\s*&\\s*(\\S+)"
match   <- regexec(pattern, string)
words   <- regmatches(string, match)

Now words is a one-element list holding a three-item vector: the whole matched string followed by the first and second backreferences. So words[[1]][2] is word1 and words[[1]][3] is word2.

like image 127
Borodin Avatar answered Nov 15 '22 22:11

Borodin