Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why was this regex calling substcont an excessive number of times?

This is more out of curiosity than anything else, as I'm failing to find any useful info on Google about this function (CORE::substcont)

In profiling and optimising some old, slow, XML parsing code I've found that the following regex is calling substcont 31 times for each time the line is executed, and taking a huge amount of time:

Calls: 10000 Time: 2.65s Sub calls: 320000 Time in subs: 1.15s`

  $handle =~s/(>)\s*(<)/$1\n$2/g;
  # spent  1.09s making 310000 calls to main::CORE:substcont, avg 4µs/call
  # spent  58.8ms making  10000 calls to main::CORE:subst, avg 6µs/call

Compared to the immediately preceding line:

Calls: 10000 Time: 371ms Sub calls: 30000 Time in subs: 221ms

  $handle =~s/(.*)\s*(<\?)/$1\n$2/g;
    # spent   136ms making 10000 calls to main::CORE:subst, avg 14µs/call
    # spent  84.6ms making 20000 calls to main::CORE:substcont, avg 4µs/call

The number of substcont calls is quite surprising, especially seeing as I would've thought that the second regex would be more expensive. This is, obviously, why profiling is a Good Thing ;-)

I've subsequently changed both these line to remove the unneccessary backrefs, with dramatic results for the badly-behaving line:

Calls:10000 Time: 393ms Sub calls: 10000 Time in subs: 341ms

$handle =~s/>\s*</>\n</g;
  # spent   341ms making 10000 calls to main::CORE:subst, avg 34µs/call
  • So, my question is - why should the original have been making SO many calls to substcont, and what does substcont even do in the regex engine that takes so long?
like image 603
paulw1128 Avatar asked May 24 '10 16:05

paulw1128


1 Answers

substcont is Perl's internal name for the "substitution iterator". Something to do with s///. Based on what little information I have, it seems substcont is triggered when doing a backref. That is, when $1 is present. You can play with it a bit using B::Concise.

Here's the opcodes of a simple regex without a backref.

$ perl -MO=Concise,-exec -we'$foo = "foo";  $foo =~ s/(foo)/bar/ig'
1  <0> enter 
2  <;> nextstate(main 1 -e:1) v:{
3  <$> const[PV "foo"] s
4  <#> gvsv[*foo] s
5  <2> sassign vKS/2
6  <;> nextstate(main 1 -e:1) v:{
7  <#> gvsv[*foo] s
8  <$> const[PV "bar"] s
9  </> subst(/"(foo)"/) vKS
a  <@> leave[1 ref] vKP/REFC
-e syntax OK

And one with.

$ perl -MO=Concise,-exec -we'$foo = "foo";  $foo =~ s/(foo)/$1/ig'
1  <0> enter 
2  <;> nextstate(main 1 -e:1) v:{
3  <$> const[PV "foo"] s
4  <#> gvsv[*foo] s
5  <2> sassign vKS/2
6  <;> nextstate(main 1 -e:1) v:{
7  <#> gvsv[*foo] s
8  </> subst(/"(foo)"/ replstart->9) vKS
9      <#> gvsv[*1] s
a      <|> substcont(other->8) sK/1
b  <@> leave[1 ref] vKP/REFC
-e syntax OK

That's all I can offer. You may want to try Rx, mjd's old regex debugger.

like image 164
Schwern Avatar answered Sep 23 '22 12:09

Schwern