I need to get words before and after a unique character (in my case: &) in a string in R.
I need to get 'word1' from something like this: "...something something word1 & word2 something..."
I can get the word after using a Perl regular expression in R: (?<=& )[^ ]*(?= )
(It seems to behave the way I would like. I got it from combing answers I found on this site)
I now need to get the word preceding the &
symbol. The length of the words change and the number of other preceding words, and also spaces, change. Word one could be letters and numbers, just bound by spaces on either side.
m operator in Perl is used to match a pattern within the given text. The string passed to m operator can be enclosed within any character which will be used as a delimiter to regular expressions.
A \w matches a single alphanumeric character (an alphabetic character, or a decimal digit) or _ , not a whole word. Use \w+ to match a string of Perl-identifier characters (which isn't the same as matching an English word).
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
If you use (\S+)\s*&\s*(\S+)
then the words both sides of &
will be captured. This allows for optional whitespace around the ampersand.
You need to double-up the backslashes in an R string, and use the regexec
and regmatches
functions to apply the pattern and extract the matched substrings.
string <- "...something something word1 & word2 something..."
pattern <- "(\\S+)\\s*&\\s*(\\S+)"
match <- regexec(pattern, string)
words <- regmatches(string, match)
Now words
is a one-element list holding a three-item vector: the whole matched string followed by the first and second backreferences. So words[[1]][2]
is word1
and words[[1]][3]
is word2
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With