Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

align string to a pattern in perl?

Tags:

string

regex

perl

I have chunks of strings within square brackets, like this:

[p1 text1/label1] [p2 text2/label2] [p3 text3/label3] [...

and so on.

What's inside each chunk isn't important. But sometimes there are stray chunks of text that are NOT surrounded by square brackets. For example:

[p1 text1/label1] [p2 text2/label2] textX/labelX  [p3 text3/label3] [...] textY/labelY textZ/labelZ [...]

I thought I had this solved fine with regex in perl until I realized that I have only catered to the cases where there is a single stray text at the beginning, the middle, or the end of the text, but not where we might have two stray cases together. (like the Y and Z chunks above).

So I realized that regular expressions in perl only catch the first matching pattern? How could the above problem be solved then?

Edit:

The problem is to ensure that all should be surrounded by brackets. Square brackets are never recursive. When surrounding a phrase with brackets, the p-value depends on the "label" value. For eg, if a stray unbracketed phrase is

li/IN

then it should turn into:

[PP li/IN]

I guess it is a mix but the only way I can think of solving the bigger problem I'm working on is to turn all of them into bracketed phrases, so the handling is easier. So I've got it working if an unbracketed phrase happens at the beginning, middle and end, but not if two or more happen together.

I basically used a different regex for each position (beginning, middle and end). The one that catches an unbracketed phrase in the middle looks like this:

$data =~ s/\] (text)#\/label \[/\] \[selected-p-value $1#\/label\] \[/g;

So what I'm doing is just noticing that if a ] comes before and after the text/label pattern, then this one doesn't have brackets. I do something similar for the others too. But I guess this is incredibly un-generic. My regex isn't great!

like image 687
user961627 Avatar asked Nov 17 '11 13:11

user961627


3 Answers

#!/usr/bin/perl

use strict;
use warnings;

my $string = "[p1 text1/label1] [p2 text2/label2] textX/labelX  [p3 text3/label3] [...] textY/labelY textZ/labelZ [...]";

# don't split inside the [], i.e. not at blanks that have p\d in front of them
my @items = split(/(?<!p\d)\s+/, $string);
my @new_items;

# modify the items that are not inside []
@new_items = map { ($_ =~ m/\[/) ? $_ :
                    ((split("/",$_))[1] eq ("IN")) ? "[PP $_]" :
                    "[BLA $_]";
                 } @items;

print join(' ', @new_items), "\n";

This gives

[p1 text1/label1] [p2 text2/label2] [PP textX/labelX] [p3 text3/label3] [...] [PP textY/labelY] [PP textZ/labelZ] [...]

I took it that PP was meant as I used it here, otherwise the map will have to become a bit more elaborate.

EDIT

I have edited the code in response to your comment. If you use

"[p1 text1/label1] [p2 text2/label2] textX/IN  [p3 text3/label3] [...] textY/labelY textZ/labelZ [...]";

as a sample string, this is the output:

[p1 text1/label1] [p2 text2/label2] [PP textX/IN] [p3 text3/label3] [...] [BLA textY/labelY] [BLA textZ/labelZ] [...]

Just one thing to bear in mind: The regex used with split will not work for pn with n > 9. If you have such cases, best look for an alternative, because variable length lookbehinds have not been implemented (or at least in my version of Perl (5.10.1) they haven't).

EDIT 2

As a reply to your second comment, here's a modified version of the script. You will find that I also added something to the sample string to demonstrate that it now works even if there's no pn inside the [...].

#!/usr/bin/perl

use strict;
use warnings;

my $string = "[p1 text1/label1] [p2 text2/label2] textX/IN  [p3 text3/label3] [...] textY/labelY textZ/labelZ [...] xyx/IN [opq rs/abc]";

# we're using a non-greedy match to only capture the contents of one set of [], 
# otherwise we'd simply match everything between the first [ and the last ].
# The parentheses around the match ensure that our delimiter is KEPT.
my @items = split(/(\[.+?\])/, $string);

#print "..$_--\n" for @items;  # uncomment this to see what the split result looks like

# modify the items that are not inside []
my @new_items = map {
                     if (/^\[/) { # items in []
                        $_;
                     }
                     elsif (/(?: \w)|(?:\w )/) { # an arbitrary number of items without []
                       my @new =  map { ($_ =~ m/\[/) ? $_ :
                                        ((split("/",$_))[1] eq ("IN")) ? "[PP $_]" :
                                        "[BLA $_]";
                                      } split;
                     }
                     else { # some items are '', let's just discard those
                     }
                    } @items;

print join(' ', @new_items), "\n";

The output is this:

[p1 text1/label1] [p2 text2/label2] [PP textX/IN] [p3 text3/label3] [...] [BLA textY/labelY] [BLA textZ/labelZ] [...] [PP xyx/IN] [opq rs/abc]

I noticed you already received the help you required, but I thought I could answer your question all the same...

like image 116
canavanin Avatar answered Nov 15 '22 00:11

canavanin


Actually you can solve this using "only" regex :

#!/usr/bin/perl

use strict;
use warnings;

$_ = "[p1 text1/label1] [p2 text2/label2] textX/labelX  [p3 text3/label3] [...] textY/labelY textZ/labelZ [...]";

s{ ([^\s[]+)|(\[(?:[^[]*)\])     }
 { if( defined $2){ $2 } elsif(defined $1)
    { 
       if($1 =~ m!(.*(?<=/)(.*))!)
       {
         if($2 eq 'labelX')
         {
            "[PP $1]";
         }
         elsif($2 eq 'labelY')
         {
            "[BLA $1]";
         }
         elsif($2 eq 'labelZ')
         {
            "[FOO $1]";
         }
       }
    }
 }xge;

 print;

Output :

[p1 text1/label1] [p2 text2/label2] [PP textX/labelX]  [p3 text3/label3] [...] [BLA textY/labelY] [FOO textZ/labelZ] [...]
like image 45
FailedDev Avatar answered Nov 15 '22 01:11

FailedDev


You have not shared your regular expression but you should use the g for global replace. Otherwise perl regular expression only replace the first match

my $teststring = "hello world";

$teststring =~ s/o/X/;

will become hellX world. but

$teststring =~ s/o/X/g;

will become hellX wXrld noticing all matches.

I think your problem is something like

my $teststring = ' A B C ';

$teststring =~ s/\s(\w)\s/ [$1] /ig;

yields [A] B [C]. It is not doing B and the reason is that as part of matching A the regex machinery also consumed the space after A. And in the remaining string there is no space before B so it doesn't match.

But if you do a non greedy match like so

$teststring =~ s/\s(\w)\s*?/ [$1] /ig;

it yields [A] [B] [C]

like image 41
parapura rajkumar Avatar answered Nov 15 '22 01:11

parapura rajkumar