I need to search for and mark patterns which are split somewhere on a line. Here is a shortened list of sample patterns which are placed in a separate file, e.g.:
CAT,TREE
LION,FOREST
OWL,WATERFALL
A match appears if the item from column 2 ever appears after and on the same line as the item from column 1. E.g.:
THEREISACATINTHETREE. (matches)
No match appears if the item from column 2 appears first on the line, e.g.:
THETREEHASACAT. (does not match)
Furthermore, no match appears if the item from column 1 and 2 touch, e.g.:
THECATTREEHASMANYBIRDS. (does not match)
Once any match is found, I need to mark it with \start{n}
(appearing after the column 1 item) and \end{n}
(appearing before the column 2 item), where n
is a simple counter which increases anytime any match is found. E.g.:
THEREISACAT\start{1}INTHE\end{1}TREE.
Here is a more complex example:
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
This becomes:
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
Sometimes there are multiple matches in the same place:
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
This becomes:
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
How can I find these matches and mark them in this way?
Here is a Perl way to do it:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# couples of patterns to search for
my @patterns = (
['CAT', 'TREE'],
['LION', 'FOREST'],
['OWL', 'WATERFALL'],
);
# loop over all sentences
while (my $line = <DATA>) {
chomp $line; #remove linefeed
my $count = 1; #counter of start/end
foreach my $pats (@patterns) {
#$p1=first pattern, $p2=second
my ($p1, $p2) = @$pats;
#split on patterns, keep them, remove empty
my @s = grep {$_} split /($p1|$p2)/, $line;
#$start=position where to put the \start
#$end=position where to pt the \end
my ($start, $end) = (undef, undef);
#loop on all elements given by split
for my $i (0 .. $#s) {
# current element
my $cur = $s[$i];
#if = first pattern, keep its position in the array
if ($cur eq $p1) {
$start = $i;
}
#if = second pattern, keep its position in the array
if ($cur eq $p2) {
$end = $i;
}
#if both are defined and second pattern after first pattern
# insert \start and \end
if (defined($start) && defined($end) && $end > $start + 1) {
$s[$start] .= "\\start{$count}";
$s[$end] = "\\end{$count}" . $s[$end];
undef $end;
$count++;
}
}
# recompose the line
$line = join '', @s;
}
say $line;
}
__DATA__
THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACATINTHETREE.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
CAT...TREE...CAT...TREE
output:
THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACAT\start{1}INTHE\end{1}TREE.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT\start{1}...\end{1}TREE...CAT\start{2}...\end{2}TREE
Check this out (Ruby):
#!/usr/bin/env ruby
patterns = [
['CAT', 'TREE'],
['LION', 'FOREST'],
['OWL', 'WATERFALL']
]
lines = [
'THEREISACATINTHETREE.',
'THETREEHASACAT.',
'THECATTREEHASMANYBIRDS.',
'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
'CAT...TREE...CAT...TREE'
]
lines.each do |line|
puts line
matches = Hash.new{|h,e| h[e] = [] }
match_indices = []
patterns.each do |first,second|
offset = 0
while new_offset = line.index(first,offset) do
# map second element of the pattern to minimal position it might be matched
matches[second] << new_offset + first.size + 1
offset = new_offset + 1
end
end
global_counter = 1
matches.each do |second,offsets|
offsets.each do |offset|
second_offset = offset
while new_offset = line.index(second,second_offset) do
# register the end index of the first pattern and
# the start index of the second pattern with the global match count
match_indices << [offset-1,new_offset,global_counter]
second_offset = new_offset + 1
global_counter += 1
end
end
end
indices = Hash.new{|h,e| h[e] = ""}
match_indices.each do |first,second,global_counter|
# build the insertion string for the string positions the
# start and end tags should be placed in
indices[first] << "\\start{#{global_counter}}"
indices[second] << "\\end{#{global_counter}}"
end
inserted_length = 0
indices.sort_by{|k,v| k}.each do |position,insert|
# insert the tags at their positions
line.insert(position + inserted_length,insert)
inserted_length += insert.size
end
puts line
end
Result
THEREISACATINTHETREE.
THEREISACAT\start{1}INTHE\end{1}TREE.
THETREEHASACAT.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT...TREE...CAT...TREE
CAT\start{1}\start{2}...\end{1}TREE...CAT\start{3}...\end{2}\end{3}TREE
EDIT
I inserted some comments and clarified some of the variables.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With