Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-greedy Pattern Expression

Tags:

regex

sed

I've been reading "Mastering Regular Expressions" by Friedl and trying to devise a common non-greedy pattern expression for a string that is delimited by a word. Starting from basics where the delimited word is just a single character 'a' the expression:

sed -r 's/([^a]*)(a)/\                                                                  
(1)\1(2)\2(ALL)&(END)/g' <<<"xaxxaxxxaxxx...aa..."

(1)x(2)a(ALL)xa(END)
(1)xx(2)a(ALL)xxa(END)
(1)xxx(2)a(ALL)xxxa(END)
(1)xxx...(2)a(ALL)xxx...a(END)
(1)(2)a(ALL)a(END)...

from which the pattern (with reference to Friedl) might be:

  • [ normal* closing ]

Moving on to a real multi-character 'ab' delimiter:

sed -r 's/([^a]*)((a[^b]*)*)(ab)/\                          
(1)\1(2)\2(3)\3(4)\4(ALL)&(END)/g' <<<"xabxxabxxxabxxx...abxxx...aabxxx...axxx...aaabxaabaxabaxaxabxaxaabxxaaabaaxxab..."

(1)x(2)(3)(4)ab(ALL)xab(END)
(1)xx(2)(3)(4)ab(ALL)xxab(END)
(1)xxx(2)(3)(4)ab(ALL)xxxab(END)
(1)xxx...(2)(3)(4)ab(ALL)xxx...ab(END)
(1)xxx...(2)a(3)a(4)ab(ALL)xxx...aab(END)
(1)xxx...(2)axxx...aa(3)axxx...aa(4)ab(ALL)xxx...axxx...aaab(END)
(1)x(2)a(3)a(4)ab(ALL)xaab(END)
(1)(2)ax(3)ax(4)ab(ALL)axab(END)
(1)(2)axax(3)axax(4)ab(ALL)axaxab(END)
(1)x(2)axa(3)axa(4)ab(ALL)xaxaab(END)
(1)xx(2)aa(3)aa(4)ab(ALL)xxaaab(END)
(1)(2)aaxx(3)aaxx(4)ab(ALL)aaxxab(END)...

from which the pattern might be:

  • [ normal* (special*)* closing ]

For the subsequent 'abc' delimiter the special expression can be extended to:

(a[^b]*)*(ab[^c]*)*
  1. Is this correct?
  2. Can it be proved?
  3. Can the special expression be simplified?
  4. Are there better/more efficient expressions for this? n.b. I'm not using perl's non-greedy '*?' operator and avoiding alternation.
  5. Where might I find reference material to this type of problem (Friedl alluded but stopped short of a published solution).
like image 575
potong Avatar asked Oct 23 '11 22:10

potong


1 Answers

  1. Yes, it looks correct.
  2. You want to read about finite automata — nondeterministic (NFA) and deterministic (DFA). Simple regexp systems are essentially a handy notation for finite automata. Any good book on compilers will have a chapter covering NFA and DFA.
  3. Probably not, or not much. The longer your word, the more backtracks you have to allow for.
like image 125
Anton Tykhyy Avatar answered Nov 16 '22 01:11

Anton Tykhyy