I have a database of subgraphs that looks like this:
t # 3-231, 1
v 0 94
v 1 14
v 2 16
v 3 17
u 0 1 2
u 0 2 2
u 0 3 2
t # 3-232, 1
v 0 14
v 1 94
v 2 19
v 3 91
u 0 1 2
u 0 3 2
u 1 2 2
t # 3-233, 1
v 0 17
v 1 91
v 2 16
v 3 94
u 0 1 2
u 0 3 2
u 1 2 2
t # 3-234, 1
v 0 90
v 1 93
v 2 102
v 3 95
u 0 1 2
u 0 3 2
u 1 2 2
I would like to retrieve all transactions that contains the following patterns: 'u 0 1 2' and 'u 0 2 2' along with transaction id (ex. line starts with t #).
I used the following code to accomplish this job:
#!/usr/bin/perl -w
use strict;
my $input = shift @ARGV or die $!;
open (FILE, "$input") or die $!;
while (<FILE>) {
my @fields = ('t', 'u\ 0\ 1', 'u\ 0\ 2');
my $field_regex = join( "|", @fields );
my @field_lines;
push( @field_lines, $_ ) if ( /^(?:$field_regex) / );
last if @field_lines == @fields;
push @field_lines, "";
print join( "\n", sort @field_lines );
}
close FILE;
However, it retrieves patterns, when only one line match, such as:
t # 3-231, 1
u 0 1 2
u 0 2 2
t # 3-232, 1
u 0 1 2
t # 3-233, 1
u 0 1 2
t # 3-233, 1
u 0 1 2
My ultimate goal is to retrieve transactions that completely match to my regex, such as
t # 3-231, 1
u 0 1 2
u 0 2 2
Thank you for your help!
Olha
The string passed to m operator can be enclosed within any character which will be used as a delimiter to regular expressions. To print this matched pattern and the remaining string, m operator provides various operators which include $, which contains whatever the last grouping match matched.
9.3. The Binding Operator, =~ Matching against $_ is merely the default; the binding operator (=~) tells Perl to match the pattern on the right against the string on the left, instead of matching against $_.
The Special Character Classes in Perl are as follows: Digit \d[0-9]: The \d is used to match any digit character and its equivalent to [0-9]. In the regex /\d/ will match a single digit. The \d is standardized to “digit”.
(\S+) | will match and capture any number (one or more) of non-space characters, followed by a space character (assuming the regular expression isn't modified with a /x flag). In both cases, these constructs appear to be one component of an alternation. Breaking it down: ( .... ) : Group and capture.
One way: keep the current transaction-id on hand, and store lines of interest in an arrayref associated with that transaction-id key in a hash.
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);
my @fields = ('u 0 1', 'u 0 2');
my $field_regex = join '|', map { quotemeta } @fields;
my (%trans, $tid);
while (<>) {
chomp;
if (/^t #/) {
$tid = $_;
next;
}
push @{$trans{$tid}}, $_ if /$field_regex/;
}
dd %trans;
# foreach my $tid (sort keys %trans) {
# say $tid;
# say for @{$trans{$tid}};
# }
I use while (<>)
which reads line by line all files given on command line when the program is invoked (or STDIN
), for simplicity here. I use Data::Dump
to show a complex data structure; there is Data::Dumper
in the core for that.
The qotemeta escapes all ASCI non-"word" characters, that can throw off regex, and this includes spaces.
The program above in general loses the order of transaction-id's from the file, since hash keys are unordered, while it keeps the order of lines for each id since those are on an array. This is not hard to remedy if needed.
Tested only with the provided data file.
This type of pattern is most easily handled by treating the file contents as blocks rather than lines.
Here is an easy example (with your data):
use strict;
my $big_string;
my $trans;
my $block;
open my $fh, '<', '/tmp/file.txt' or die "Can't open file $!";
$big_string = do { local $/; <$fh> };
while ($big_string=~/^(t\h*#\h*[0-9,\h-]+[\s\S]*?(?=(?:^t\h*#\h*[0-9,\h-]+)|\z))/mg) {
$block=$1;
$trans=$1 if $block=~/^(t\h*#\h*[0-9,\h-]+)/;
if ($block=~/^(u\h+0\h+[12]\h+2)/m){
print "$trans\n";
for ($block=~/^(u\h+0\h+[12]\h+2)/mg) {
print "$1\n";
}
}
}
Prints:
t # 3-231, 1
u 0 2 2
u 0 2 2
t # 3-232, 1
u 0 1 2
t # 3-233, 1
u 0 1 2
t # 3-234, 1
u 0 1 2
This assumes that your data fits easily into memory. If not, there are many ways to read a file block by block as well.
If you only want the blocks that have more than one match to the second regex:
while ($big_string=~/^(t\h*#\h*[0-9,\h-]+[\s\S]*?(?=(?:^t\h*#\h*[0-9,\h-]+)|\z))/mg) {
$block=$1;
$trans=$1 if $block=~/^(t\h*#\h*[0-9,\h-]+)/;
@matches=$block=~/^(u\h+0\h+[12]\h+2)/mg;
if (scalar @matches >1) {
print "$trans\n";
print join "\n", @matches;
}
}
Prints:
t # 3-231, 1
u 0 1 2
u 0 2 2
And, of course, TIMTOWDI:
my @result = do{ local @_ = split(/^(t\h+#[\h\d,-]+)/m, $big_string);
@_[1..$#_]};
for my $i (0 .. @result/2-1) {
@matches=$result[2*$i+1]=~/^(u\h+0\h+[12]\h+2)/mg;
if (scalar @matches>1){
print "$result[2*$i]\n";
print join("\n", @matches);
}
}
t # 3-231, 1
u 0 1 2
u 0 2 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With