Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract a year number from a string that is surrounded by special characters

Tags:

regex

r

What's a good way to extract only the number 2007 from the following string:

some_string <- "1_2_start_2007_3_end"

The pattern to detect the year number in my case would be:

  • 4 digits
  • surrounded by "_"

I am quite new to using regular expressions. I tried the following:

 regexp <- "_+[0-9]+_"
 names <- str_extract(files, regexp)

But this does not take into account that there are always 4 digits and outputs the underlines as well.

like image 491
Patrick Balada Avatar asked Dec 18 '22 01:12

Patrick Balada


1 Answers

You may use a sub option, too:

some_string <- "1_2_start_2007_3_end"
sub(".*_(\\d{4})_.*", "\\1", some_string)

See the regex demo

Details

  • .* - any 0+ chars, as many as possible
  • _ - a _ char
  • (\\d{4}) - Group 1 (referred to via \1 from the replacement pattern): 4 digits
  • _.* - a _ and then any 0+ chars up to the end of string.

NOTE: akrun's str_extract(some_string, "(?<=_)\\d{4}") will extract the leftmost occurrence and my sub(".*_(\\d{4})_.*", "\\1", some_string) will extract the rightmost occurrence of a 4-digit substring enclosed with _. For my my solution to return the leftmost one use a lazy quantifier with the first .: sub(".*?_(\\d{4})_.*", "\\1", some_string).

R test:

some_string <- "1_2018_start_2007_3_end"
sub(".*?_(\\d{4})_.*", "\\1", some_string) # leftmost
## -> 2018
sub(".*_(\\d{4})_.*", "\\1", some_string) # rightmost
## -> 2007
like image 55
Wiktor Stribiżew Avatar answered May 08 '23 21:05

Wiktor Stribiżew