I have a database of subgraphs that looks like this: <pre class="prettyprint"><code>t # 3-231, 1 v 0 94 v 1 14 v 2 16 v 3 17 u 0 1 2 u 0 2 2 u 0 3 2 t # 3-232, 1 v 0 14 v 1 94 v 2 19 v 3 91 u 0 1 2 u 0 3 2 u 1 2 2 t # 3-233, 1 v 0 17 v 1 91 v 2 16 v 3 94 u 0 1 2 u 0 3 2 u 1 2 2 t # 3-234, 1 v 0 90 v 1 93 v 2 102 v 3 95 u 0 1 2 u 0 3 2 u 1 2 2 </code></pre> I would like to retrieve all transactions that contains the following patterns: 'u 0 1 2' and 'u 0 2 2' along with transaction id (ex. line starts with t #). I used the following code to accomplish this job: <pre class="prettyprint"><code>#!/usr/bin/perl -w use strict; my $input = shift @ARGV or die $!; open (FILE, "$input") or die $!; while (<FILE>) { my @fields = ('t', 'u\ 0\ 1', 'u\ 0\ 2'); my $field_regex = join( "|", @fields ); my @field_lines; push( @field_lines, $_ ) if ( /^(?:$field_regex) / ); last if @field_lines == @fields; push @field_lines, ""; print join( "\n", sort @field_lines ); } close FILE; </code></pre> However, it retrieves patterns, when only one line match, such as: <pre class="prettyprint"><code>t # 3-231, 1 u 0 1 2 u 0 2 2 t # 3-232, 1 u 0 1 2 t # 3-233, 1 u 0 1 2 t # 3-233, 1 u 0 1 2 </code></pre> My ultimate goal is to retrieve transactions that completely match to my regex, such as <pre class="prettyprint"><code>t # 3-231, 1 u 0 1 2 u 0 2 2 </code></pre> Thank you for your help! Olha

One way: keep the current transaction-id on hand, and store lines of interest in an arrayref associated with that transaction-id key in a hash. <pre class="prettyprint"><code>use warnings; use strict; use feature 'say'; use Data::Dump qw(dd); my @fields = ('u 0 1', 'u 0 2'); my $field_regex = join '|', map { quotemeta } @fields; my (%trans, $tid); while (<>) { chomp; if (/^t #/) { $tid = $_; next; } push @{$trans{$tid}}, $_ if /$field_regex/; } dd %trans; # foreach my $tid (sort keys %trans) { # say $tid; # say for @{$trans{$tid}}; # } </code></pre> I use <code>while (<>)</code> which reads line by line all files given on command line when the program is invoked (or <code>STDIN</code>), for simplicity here. I use <code>Data::Dump</code> to show a complex data structure; there is <code>Data::Dumper</code> in the core for that. The qotemeta escapes all ASCI non-"word" characters, that can throw off regex, and this includes spaces. The program above in general loses the order of transaction-id's from the file, since hash keys are unordered, while it keeps the order of lines for each id since those are on an array. This is not hard to remedy if needed. Tested only with the provided data file.

retrieve patterns that exactly match all regex in Perl

Tags:

regex

match

perl

I have a database of subgraphs that looks like this:

t # 3-231, 1
v 0 94
v 1 14
v 2 16
v 3 17
u 0 1 2
u 0 2 2
u 0 3 2
t # 3-232, 1
v 0 14
v 1 94
v 2 19
v 3 91
u 0 1 2
u 0 3 2
u 1 2 2
t # 3-233, 1
v 0 17
v 1 91
v 2 16
v 3 94
u 0 1 2
u 0 3 2
u 1 2 2
t # 3-234, 1
v 0 90
v 1 93
v 2 102
v 3 95
u 0 1 2
u 0 3 2
u 1 2 2

I would like to retrieve all transactions that contains the following patterns: 'u 0 1 2' and 'u 0 2 2' along with transaction id (ex. line starts with t #).

I used the following code to accomplish this job:

#!/usr/bin/perl -w

use strict;

my $input = shift @ARGV or die $!; 

open (FILE, "$input") or die $!;

while (<FILE>) {

my @fields = ('t', 'u\ 0\ 1', 'u\ 0\ 2');  
my $field_regex = join( "|", @fields );
my @field_lines;

    push( @field_lines, $_ ) if ( /^(?:$field_regex) / );
    last if @field_lines == @fields;

push @field_lines, "";

print join( "\n", sort @field_lines );
}

close FILE;

However, it retrieves patterns, when only one line match, such as:

t # 3-231, 1
u 0 1 2
u 0 2 2
t # 3-232, 1
u 0 1 2
t # 3-233, 1
u 0 1 2
t # 3-233, 1
u 0 1 2

My ultimate goal is to retrieve transactions that completely match to my regex, such as

t # 3-231, 1
u 0 1 2
u 0 2 2

Thank you for your help!

Olha

419

asked Aug 14 '20 00:08

Olha Kholod

2 Answers

One way: keep the current transaction-id on hand, and store lines of interest in an arrayref associated with that transaction-id key in a hash.

use warnings;
use strict;
use feature 'say';    
use Data::Dump qw(dd);

my @fields = ('u 0 1', 'u 0 2');  
my $field_regex = join '|', map { quotemeta } @fields;
    
my (%trans, $tid);

while (<>) {
    chomp;
    if (/^t #/) { 
        $tid = $_; 
        next;
    }   
  
    push @{$trans{$tid}}, $_  if /$field_regex/;
}

dd %trans;

# foreach my $tid (sort keys %trans) { 
#     say $tid;
#     say for @{$trans{$tid}};
# }

I use while (<>) which reads line by line all files given on command line when the program is invoked (or STDIN), for simplicity here. I use Data::Dump to show a complex data structure; there is Data::Dumper in the core for that.

The qotemeta escapes all ASCI non-"word" characters, that can throw off regex, and this includes spaces.

The program above in general loses the order of transaction-id's from the file, since hash keys are unordered, while it keeps the order of lines for each id since those are on an array. This is not hard to remedy if needed.

Tested only with the provided data file.

149

answered Nov 29 '22 04:11

zdim

This type of pattern is most easily handled by treating the file contents as blocks rather than lines.

Here is an easy example (with your data):

use strict;

my $big_string;
my $trans;
my $block;

open my $fh, '<', '/tmp/file.txt' or die "Can't open file $!";

$big_string = do { local $/; <$fh> };

while ($big_string=~/^(t\h*#\h*[0-9,\h-]+[\s\S]*?(?=(?:^t\h*#\h*[0-9,\h-]+)|\z))/mg) {
    $block=$1;
    $trans=$1 if $block=~/^(t\h*#\h*[0-9,\h-]+)/;
    if ($block=~/^(u\h+0\h+[12]\h+2)/m){
        print "$trans\n";
        for ($block=~/^(u\h+0\h+[12]\h+2)/mg) {
            print "$1\n";
        }
    }
}

Prints:

t # 3-231, 1
u 0 2 2
u 0 2 2
t # 3-232, 1
u 0 1 2
t # 3-233, 1
u 0 1 2
t # 3-234, 1
u 0 1 2

This assumes that your data fits easily into memory. If not, there are many ways to read a file block by block as well.

If you only want the blocks that have more than one match to the second regex:

while ($big_string=~/^(t\h*#\h*[0-9,\h-]+[\s\S]*?(?=(?:^t\h*#\h*[0-9,\h-]+)|\z))/mg) {
    $block=$1;
    $trans=$1 if $block=~/^(t\h*#\h*[0-9,\h-]+)/;
    @matches=$block=~/^(u\h+0\h+[12]\h+2)/mg;
    if (scalar @matches >1) {
        print "$trans\n";
        print join "\n", @matches;
    }
}

Prints:

t # 3-231, 1
u 0 1 2
u 0 2 2

And, of course, TIMTOWDI:

my @result = do{ local @_ = split(/^(t\h+#[\h\d,-]+)/m, $big_string); 
                        @_[1..$#_]};

for my $i (0 .. @result/2-1) {
    @matches=$result[2*$i+1]=~/^(u\h+0\h+[12]\h+2)/mg;
    if (scalar @matches>1){
        print "$result[2*$i]\n";
        print join("\n", @matches);
    }
}   
t # 3-231, 1
u 0 1 2
u 0 2 2

answered Nov 29 '22 05:11

dawg

Related questions
                            
                                Bug with RegExp.test (JavaScript)
                            
                                Python split string exactly on one space. if double space make " word" not "word"
                            
                                How to check if a string contains a date in Java?
                            
                                VBA regex and group
                            
                                str_extract: Extracting exactly nth word from a string
                            
                                Exclude specific subdomain or no subdomain in htaccess RewriteRule
                            
                                Python Replace Whole Values in Dataframe String and Not Substrings
                            
                                REGEX filter with Pandas (any numeric combination followed by 'plus' sign)
                            
                                Two conditions for split a column
                            
                                Looking for specific regex
                            
                                Angular and text-mask: Regex to validate an input without special characters
                            
                                debezium table whitelist with regular expression
                            
                                Matching strings in PowerShell
                            
                                Angular Form Input block (space) REGEX
                            
                                Pandas .str.replace and case insensitivity
                            
                                Is it possible to do an OR in a bash regular expression?
                            
                                Calling Bool on a Regex does not work as documented
                            
                                Pandas str.split without stripping split pattern
                            
                                Regex Group Capture [duplicate]
                            
                                How can I tell Pandas read_csv to use multiple whitespaces as separators but not single whitespaces?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With