Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Positive lookahead in R

Tags:

regex

r

Novice on regular expressions here ...

Assume the following names:

names <- c("Jackson, Michael", "Lennon, John", "Obama, Barack")

I want to split the names, as to retain all the characters up to and including the first letter of the first name. Thus, the results would look this:

Jackson, M
Lennon, J
Obama, B

I know this is a simple solution, but I am stuck on specifying what seems to be a reasonable solution -- that is, a positive lookahead regex. I am specifying a match based on the comma, the space, and the first letter in caps. This is what I have but obviously it is wrong:

names.reduced <- gsub("(?=\\,\\s[A-Z]).*", "", names)
like image 812
Brian P Avatar asked Apr 11 '15 02:04

Brian P


People also ask

What is a positive lookahead?

The positive lookahead construct is a pair of parentheses, with the opening parenthesis followed by a question mark and an equals sign. You can use any regular expression inside the lookahead (but not lookbehind, as explained below). Any valid regular expression can be used inside the lookahead.

What is positive look ahead regex?

Positive lookahead: In this type the regex engine searches for a particular element which may be a character or characters or a group after the item matched. If that particular element is present then the regex declares the match as a match otherwise it simply rejects that match.

What is positive and negative lookahead?

Positive lookahead: (?= «pattern») matches if pattern matches what comes after the current location in the input string. Negative lookahead: (?! «pattern») matches if pattern does not match what comes after the current location in the input string.

What is a regex lookahead?

Lookahead is used as an assertion in Python regular expressions to determine success or failure whether the pattern is ahead i.e to the right of the parser's current position. They don't match anything. Hence, they are called as zero-width assertions.


3 Answers

(?= ... ) is a zero-width assertion which does not consume any characters on the string.

It only matches a position in the string. The point of zero-width is the validation to see if a regular expression can or cannot be matched looking ahead from the current position, without adding to the overall match. In this case, using a lookahead assertion is not necessary at all.

You can do this using a capture group, backreferencing the group inside the replacement call.

sub('(.*[A-Z]).*', '\\1', names)
# [1] "Jackson, M" "Lennon, J"  "Obama, B"

Or better yet, you can use negation to remove all except A to Z at the end of the string.

sub('[^A-Z]*$', '', names)
# [1] "Jackson, M" "Lennon, J"  "Obama, B"
like image 184
hwnd Avatar answered Sep 24 '22 15:09

hwnd


You can use a lookbehind instead of the lookahead assertion

sub('(?<=, [A-Z]).*$', '', names, perl=TRUE)
#[1] "Jackson, M" "Lennon, J"  "Obama, B"  
like image 38
akrun Avatar answered Sep 22 '22 15:09

akrun


You could use regmatches function also.

> names <- c("Jackson, Michael", "Lennon, John", "Obama, Barack")
> regmatches(names, regexpr(".*,\\s*[A-Z]", names))
[1] "Jackson, M" "Lennon, J"  "Obama, B"

OR

> library(stringi)
> stri_extract(names, regex=".*,\\s*[A-Z]")
[1] "Jackson, M" "Lennon, J"  "Obama, B"  

OR

Just match all the chars upto the last uppercase letter.

> stri_extract(names, regex=".*[A-Z]")
[1] "Jackson, M" "Lennon, J"  "Obama, B"  
like image 31
Avinash Raj Avatar answered Sep 25 '22 15:09

Avinash Raj