Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to stop .+ at the first instance of a character and not the last with regular expressions in perl?

Tags:

regex

perl

I want to replace:

'''<font size="3"><font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font></font>'''

With:

='''<font color="blue"> SUMMER/WINTER CONFIGURATION FILES</font>'''=

Now my existing code is:

$html =~ s/\n(.+)<font size=\".+?\">(.+)<\/font>(.+)\n/\n=$1$2$3=\n/gm

However this ends up with this as the result:

=''' SUMMER/WINTER CONFIGURATION FILES</font>'''=

Now I can see what is happening, it is matching <font size ="..... all the way up to the end of the <font colour blue"> which is not what I want, I want it to stop at the first instance of " not the last, I thought that is what putting the ? mark there would do, however I've tried .+ .+? .* and .*? with the same result each time.

Anyone got any ideas what I am doing wrong?

like image 434
rollsch Avatar asked Dec 21 '10 03:12

rollsch


2 Answers

Write .+? in all places to make each match non-greedy.

$html =~ s/\n(.+?)<font size=\".+?\">(.+?)<\/font>(.+?)\n/\n=$1$2$3=\n/gm
                ^                ^      ^            ^

Also try to avoid using regular expressions to parse HTML. Use an HTML parser if possible.

like image 145
Mark Byers Avatar answered Oct 01 '22 01:10

Mark Byers


You could change .+ to [^"]+ (instead of "match anything", "match anything that isn't a ""...

like image 23
Jon Avatar answered Oct 01 '22 02:10

Jon