Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

difference between [0-9] n times and [0-9]{n} in R regexp

Tags:

regex

r

Both are supposed to the best of my knowledge to be the same but I actually see a difference, look at this minimal example from this question:

a<-c("/Cajon_Criolla_20141024","/Linon_20141115_20141130",
"/Cat/LIQUID",
"/c_puertas_20141206_20141107",
"/C_Puertas_3_20141017_20141018",
"/c_puertas_navidad_20141204_20141205")

sub("(.*?)_([0-9]{8})(.*)$","\\2",a)
[1] "20141024"    "20141130"    "/Cat/LIQUID" "20141107" "20141018"   
[6] "20141205"   

sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",a)
[1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206" "20141017"   
[6] "20141204" 

What am I missing? Or is this a bug?

like image 632
cmbarbu Avatar asked Feb 25 '15 21:02

cmbarbu


2 Answers

This is a bug in the TRE library related to greedy modifiers and capture groups. See:

  • SO question with similar issue
  • Issue #11 on TRE repo
  • Issue #21.
like image 55
BrodieG Avatar answered Oct 07 '22 02:10

BrodieG


Setting perl=TRUE gives the same answer (as expected) for both expressions:

> sub("(.*?)_([0-9]{8})(.*)$","\\2",a,perl=TRUE)
[1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"    "20141204"   
> sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",a,perl=TRUE)
[1] "20141024"    "20141115"    "/Cat/LIQUID" "20141206"    "20141017"    "20141204"
like image 44
Metrics Avatar answered Oct 07 '22 02:10

Metrics