Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Complex lookahead/not-greedy regex

I have the following string:

(it is all on a single line)

<IMG SRC="/include/images/moredetails.png" WIDTH="8" HEIGHT="7" ONMOUSEOVER="return createPopup('<b>[scan_name@user:home]:</b> <!-- #EscapedName# --><br><b>[organization@user:home]:</b><br><!-- #EscapedOrganizationPath# --><br><b>[total@user:home]:</b> <!-- #EscapedTotal# --><br><b>[high@user:home]:</b> <!-- #EscapedHigh# --><br><b>[medium@user:home]:</b> <!-- #EscapedMedium# --><br><b>[low@user:home]:</b> <!-- #EscapedLow# --><br><b>[date_last_scanned@user:home]:</b> <!-- #EscapedDate# -->');" ONMOUSEOUT="return nd(1000);"><!-- #Name# --></TD>

And the second string:

<IMG SRC="/include/images/moredetails.png" WIDTH="8" HEIGHT="7" ONMOUSEOVER="return createPopup('<b>[scan_name@user:home]:</b> <!-- #EscapedName# --><br><b>조직/부서 경로:</b><br><!-- #EscapedOrganizationPath# --><br><b>[total@user:home]:</b> <!-- #EscapedTotal# --><br><b>[high@user:home]:</b> <!-- #EscapedHigh# --><br><b>[medium@user:home]:</b> <!-- #EscapedMedium# --><br><b>[low@user:home]:</b> <!-- #EscapedLow# --><br><b>[date_last_scanned@user:home]:</b> <!-- #EscapedDate# -->');" ONMOUSEOUT="return nd(1000);"><!-- #Name# --></TD>

I want from the first string to find all the [..] place holders and find in the second string their Korean translation.

I wrote code that does this:

while($stringA =~ /(.*?)(\[[^\]]+?\])(.*?)/g) {
 my $prefix = $1;
 my $tag = $2;
 my $suffix = $3;

And then calls a regex on the $prefix and $suffix:

if ($stringB =~ /\Q$prefix\E(.*)\Q$suffix\E/g) {

NOTE The below copied examples don't escape ", I just did that to make it clearer

Problems:

A. $prefix and $suffix don't contain everything before and after that placeholder, because I am using non-greedy. Example:

$prefix = "<IMG SRC="/include/images/moredetails.png" WIDTH="8" HEIGHT="7" ONMOUSEOVER="return createPopup('<b>"
$tag = "[scan_name@user:home]"
$suffix = ""

B. If I don't use greedy (.*)(\[[^\]]+?\])(.*) I capture everything "correctly", but only the last tag gets caught. Example:

$prefix = "<IMG SRC="/include/images/moredetails.png" WIDTH="8" HEIGHT="7" ONMOUSEOVER="return createPopup('<b>[scan_name@user:home]:</b> <!-- #EscapedName# --><br><b>[organization@user:home]:</b><br><!-- #EscapedOrganizationPath# --><br><b>[total@user:home]:</b> <!-- #EscapedTotal# --><br><b>[high@user:home]:</b> <!-- #EscapedHigh# --><br><b>[medium@user:home]:</b> <!-- #EscapedMedium# --><br><b>[low@user:home]:</b> <!-- #EscapedLow# --><br><b>"
$tag = "[date_last_scanned@user:home]"
$suffix = ":</b> <!-- #EscapedDate# -->');" ONMOUSEOUT="return nd(1000);"><!-- #Name# --></TD>"

What I want

I want to capture all the tags, and able to compare it to the translated string and return something like:

'[state@user:home] = '상태'

Thank you for your help

like image 979
Noam Rathaus Avatar asked Nov 02 '22 08:11

Noam Rathaus


1 Answers

How about:

my $strA = q~<IMG SRC="/include/images/moredetails.png" WIDTH="8" HEIGHT="7" ONMOUSEOVER="return createPopup('<b>[scan_name@user:home]:</b> <!-- #EscapedName# --><br><b>[organization@user:home]:</b><br><!-- #EscapedOrganizationPath# --><br><b>[total@user:home]:</b> <!-- #EscapedTotal# --><br><b>[high@user:home]:</b> <!-- #EscapedHigh# --><br><b>[medium@user:home]:</b> <!-- #EscapedMedium# --><br><b>[low@user:home]:</b> <!-- #EscapedLow# --><br><b>[date_last_scanned@user:home]:</b> <!-- #EscapedDate# -->');" ONMOUSEOUT="return nd(1000);"><!-- #Name# --></TD>~;
my $strB = q~<IMG SRC="/include/images/moredetails.png" WIDTH="8" HEIGHT="7" ONMOUSEOVER="return createPopup('<b>[scan_name@user:home]:</b> <!-- #EscapedName# --><br><b>조직/부서 경로:</b><br><!-- #EscapedOrganizationPath# --><br><b>[total@user:home]:</b> <!-- #EscapedTotal# --><br><b>[high@user:home]:</b> <!-- #EscapedHigh# --><br><b>[medium@user:home]:</b> <!-- #EscapedMedium# --><br><b>[low@user:home]:</b> <!-- #EscapedLow# --><br><b>[date_last_scanned@user:home]:</b> <!-- #EscapedDate# -->');" ONMOUSEOUT="return nd(1000);"><!-- #Name# --></TD>~;

while($strA =~ /(.*?)\[([^\]]+?)\](.)/g) {
    my $prefix = $1;
    my $tag = $2;
    my $suffix = $3;
    print "prefix=$prefix\ntag=$tag\nsuffix=$suffix\n";
    print "found it $1\n\n" if ($strB =~ /\Q$prefix\E\[?([^\[\]]+)\]?\Q$suffix\E/g);
}

If you want a longer suffix to avoid overlap, you can use this:

while($strA =~ /(.*?)\[([^\]]+?)\]([^[]*))/g) {

Output:

prefix=<IMG SRC="/include/images/moredetails.png" WIDTH="8" HEIGHT="7" ONMOUSEOVER="return createPopup('<b>
tag=scan_name@user:home
suffix=:
found it scan_name@user:home

prefix=</b> <!-- #EscapedName# --><br><b>
tag=organization@user:home
suffix=:
found it 조직/부서 경로

prefix=</b><br><!-- #EscapedOrganizationPath# --><br><b>
tag=total@user:home
suffix=:
found it total@user:home

prefix=</b> <!-- #EscapedTotal# --><br><b>
tag=high@user:home
suffix=:
found it high@user:home

prefix=</b> <!-- #EscapedHigh# --><br><b>
tag=medium@user:home
suffix=:
found it medium@user:home

prefix=</b> <!-- #EscapedMedium# --><br><b>
tag=low@user:home
suffix=:
found it low@user:home

prefix=</b> <!-- #EscapedLow# --><br><b>
tag=date_last_scanned@user:home
suffix=:
found it date_last_scanned@user:home
like image 117
Toto Avatar answered Nov 13 '22 07:11

Toto