How do you implement non-greedy matching in Stata using regex? Or does Stata even have this capability?
I want to extract all text that occurs between a hashtag "#" and a period ".".
Example code:
clear
set obs 3
generate var1="anything#aaabbbccc.dddeee.fff" in 1
replace var1="anything#aaabbbccc.dddeee" in 2
replace var1="anything#aaabbbccc." in 3
generate var2=regexs(1) if regexm(var1,"#(.*)\.")
list
But in Stata (v.13.1), I can't seem to be able to use the non-greedy character #(.*?)\.. Thus, above code gives this:
+--------------------------------------------------+
| var1 var2 |
|--------------------------------------------------|
| anything#aaabbbccc.dddeee.fff aaabbbccc.dddeee |
| anything#aaabbbccc.dddeee aaabbbccc |
| anything#aaabbbccc. aaabbbccc |
+--------------------------------------------------+
But what I want is this:
+--------------------------------------------------+
| var1 var2 |
|--------------------------------------------------|
| anything#aaabbbccc.dddeee.fff aaabbbccc |
| anything#aaabbbccc.dddeee aaabbbccc |
| anything#aaabbbccc. aaabbbccc |
+--------------------------------------------------+
One play on using #(.*?)\. would be to just match any non dot character occurring after the hash sign, i.e. this pattern:
#([^.]*)
Try this code:
clear
set obs 3
generate var1="anything#aaabbbccc.dddeee.fff" in 1
replace var1="anything#aaabbbccc.dddeee" in 2
replace var1="anything#aaabbbccc." in 3
generate var2=regexs(1) if regexm(var1,"#([^.]*)")
list
Once many programmers have learned about regular expressions, they are reluctant to look elsewhere in string management, and with good reason.
This is just to point out that for the problem given, and many others too, there is a pedestrian alternative:
clear
set obs 3
generate var1="anything#aaabbbccc.dddeee.fff" in 1
replace var1="anything#aaabbbccc.dddeee" in 2
replace var1="anything#aaabbbccc." in 3
generate var2=regexs(1) if regexm(var1,"#([^.]*)")
gen where1 = strpos(var1, "#") + 1
gen where2 = strpos(var1, ".")
gen var3 = substr(var1, where1, where2 - where1)
list
+-------------------------------------------------------------------------+
| var1 var2 where1 where2 var3 |
|-------------------------------------------------------------------------|
1. | anything#aaabbbccc.dddeee.fff aaabbbccc 10 19 aaabbbccc |
2. | anything#aaabbbccc.dddeee aaabbbccc 10 19 aaabbbccc |
3. | anything#aaabbbccc. aaabbbccc 10 19 aaabbbccc |
+-----------------------------------------------------------------------
Find the positions of the start and end of the substring you want, and extract what lies between. This is resolutely lacking in style, but sometimes gets you there faster. Always remember to account for programmer time in working out the regular expression you need.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With