Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Regular Expression Lookbehind

I have a vector filled with strings of the following format: <year1><year2><id1><id2>

the first entries of the vector looks like this:

199719982001
199719982002
199719982003
199719982003

For the first entry we have: year1 = 1997, year2 = 1998, id1 = 2, id2 = 001.

I want to write a regular expression that pulls out year1, id1, and the digits of id2 that are not zero. So for the first entry the regex should output: 199721.

I have tried doing this with the stringr package, and created the following regex:

"^\\d{4}|\\d{1}(?<=\\d{3}$)"

to pull out year1 and id1, however when using the lookbehind i get a "invalid regular expression" error. This is a bit puzzling to me, can R not handle lookaheads and lookbehinds?

like image 718
Thomas Jensen Avatar asked Jan 12 '12 11:01

Thomas Jensen


People also ask

What is Lookbehind in regex?

Lookbehind has the same effect, but works backwards. It tells the regex engine to temporarily step backwards in the string, to check if the text inside the lookbehind can be matched there.

What is Lookbehind?

Unlike look-ahead, look-behind is used when the pattern appears before a desired match. You're “looking behind” to see if a certain string of text has the desired pattern behind it. If it does, then that string of text is a match.

What is positive look ahead regex?

Positive lookahead: In this type the regex engine searches for a particular element which may be a character or characters or a group after the item matched. If that particular element is present then the regex declares the match as a match otherwise it simply rejects that match.

What is Lookbehind assertion?

Regex Lookbehind is used as an assertion in Python regular expressions(re) to determine success or failure whether the pattern is behind i.e to the right of the parser's current position. They don't match anything. Hence, Regex Lookbehind and lookahead are termed as a zero-width assertion.


2 Answers

Since this is fixed format, why not use substr? year1 is extracted using substr(s,1,4), id1 is extracted using substr(s,9,9) and the id2 as as.numeric(substr(s,10,13)). In the last case I used as.numeric to get rid of the zeroes.

like image 58
mpiktas Avatar answered Nov 10 '22 07:11

mpiktas


You can use sub.

sub("^(.{4}).{4}(.{1}).*([1-9]{1,3})$","\\1\\2\\3",s)
like image 33
Wojciech Sobala Avatar answered Nov 10 '22 07:11

Wojciech Sobala