Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract a numeric pattern between two only underscores in string

Tags:

regex

r

gsub

I am relatively new to regular expressions and I am running into a dead end. I have a data frame with a column that looks like this:

year1
GMM14_2000_NGVA
GMM14_2001_NGVA
GMM14_2002_NGVA
...
GMM14_2014_NGVA

I am trying to extract the year in the middle of the string (2000,2001, etc). This is my code thus far

gsub("[^0-9]","",year1))

Which returns the number but it also returns the 14 that is part of the string:

142000
142001

Any idea on how to exclude the 14 from the pattern or how to extract the year information more efficiently?

Thanks

like image 973
asado23 Avatar asked Oct 01 '15 14:10

asado23


1 Answers

Use the following gsub:

s  = "GMM14_2002_NGVA"
gsub("^[^_]*_|_[^_]*$", "", s)

See IDEONE demo

The regex breakdown:

Match...

  • ^[^_]*_ - 0 or more characters other than _ from the start of string and a_
  • | - or...
  • _[^_]*$ - a _ and 0 or more characters other than _ to the end of string

and remove them.

As an alternative,

library(stringr)
str_extract(s,"(?<=_)\\d{4}(?=_)")

Where the Perl-like regex matches 4-digit substring that is enclosed with underscores.

like image 148
Wiktor Stribiżew Avatar answered Oct 04 '22 12:10

Wiktor Stribiżew