Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching and marking paired patterns on a line

I need to search for and mark patterns which are split somewhere on a line. Here is a shortened list of sample patterns which are placed in a separate file, e.g.:

CAT,TREE
LION,FOREST
OWL,WATERFALL

A match appears if the item from column 2 ever appears after and on the same line as the item from column 1. E.g.:

THEREISACATINTHETREE. (matches)

No match appears if the item from column 2 appears first on the line, e.g.:

THETREEHASACAT. (does not match)

Furthermore, no match appears if the item from column 1 and 2 touch, e.g.:

THECATTREEHASMANYBIRDS. (does not match)

Once any match is found, I need to mark it with \start{n} (appearing after the column 1 item) and \end{n} (appearing before the column 2 item), where n is a simple counter which increases anytime any match is found. E.g.:

THEREISACAT\start{1}INTHE\end{1}TREE.

Here is a more complex example:

THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.

This becomes:

THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.

Sometimes there are multiple matches in the same place:

 THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.

This becomes:

 THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
  • There are no spaces in the file.
  • Many non-Latin characters appear in the file.
  • Pattern matches need only be found on the same line (e.g. "CAT" on line 1 does not ever match with a "TREE" found on line 2, as those are on different lines).

How can I find these matches and mark them in this way?

like image 495
Village Avatar asked Mar 12 '12 15:03

Village


2 Answers

Here is a Perl way to do it:

#!/usr/bin/perl
use strict;
use warnings;
use 5.010;

# couples of patterns to search for
my @patterns = (
    ['CAT', 'TREE'],
    ['LION', 'FOREST'],
    ['OWL', 'WATERFALL'],
);

# loop over all sentences
while (my $line = <DATA>) {
    chomp $line;    #remove linefeed
    my $count = 1;  #counter of start/end
    foreach my $pats (@patterns) {
        #$p1=first pattern, $p2=second
        my ($p1, $p2) = @$pats;

        #split on patterns, keep them, remove empty
        my @s = grep {$_} split /($p1|$p2)/, $line;

        #$start=position where to put the \start
        #$end=position where to pt the \end
        my ($start, $end) = (undef, undef);

        #loop on all elements given by split
        for my $i (0 .. $#s) {
            # current element
            my $cur = $s[$i];

            #if = first pattern, keep its position in the array
            if ($cur eq $p1) {
                $start = $i;
            }

            #if = second pattern, keep its position in the array
            if ($cur eq $p2) {
                $end = $i;
            }

            #if both are defined and second pattern after first pattern
            # insert \start and \end
            if (defined($start) && defined($end) && $end > $start + 1) {
                $s[$start] .= "\\start{$count}";
                $s[$end] = "\\end{$count}" . $s[$end];
                undef $end;
                $count++;
            }
        }
        # recompose the line
        $line = join '', @s;
    }
    say $line;
}

__DATA__
THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACATINTHETREE.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINATREENEARTHEWATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
CAT...TREE...CAT...TREE

output:

THETREEHASACAT. (does not match)
THECATTREEHASMANYBIRDS. (does not match)
THEREISACAT\start{1}INTHE\end{1}TREE.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}INA\end{1}TREENEARTHE\end{3}WATERFALL.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT\start{1}...\end{1}TREE...CAT\start{2}...\end{2}TREE
like image 113
Toto Avatar answered Oct 07 '22 16:10

Toto


Check this out (Ruby):

#!/usr/bin/env ruby
patterns = [
  ['CAT', 'TREE'],
  ['LION', 'FOREST'],
  ['OWL', 'WATERFALL']
]

lines = [
  'THEREISACATINTHETREE.',
  'THETREEHASACAT.',
  'THECATTREEHASMANYBIRDS.',
  'THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.',
  'THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.',
  'CAT...TREE...CAT...TREE'
]

lines.each do |line|
  puts line
  matches = Hash.new{|h,e| h[e] = [] }
  match_indices = []
  patterns.each do |first,second|
    offset = 0
    while new_offset = line.index(first,offset) do
      # map second element of the pattern to minimal position it might be matched
      matches[second] << new_offset + first.size + 1
      offset = new_offset + 1
    end
  end
  global_counter = 1
  matches.each do |second,offsets|
    offsets.each do |offset|
      second_offset = offset
      while new_offset = line.index(second,second_offset) do
        # register the end index of the first pattern and 
        # the start index of the second pattern with the global match count
        match_indices << [offset-1,new_offset,global_counter]
        second_offset = new_offset + 1
        global_counter += 1
      end
    end
  end
  indices = Hash.new{|h,e| h[e] = ""}
  match_indices.each do |first,second,global_counter|
    # build the insertion string for the string positions the 
    # start and end tags should be placed in
    indices[first] << "\\start{#{global_counter}}"
    indices[second] << "\\end{#{global_counter}}"
  end
  inserted_length = 0
  indices.sort_by{|k,v| k}.each do |position,insert|
    # insert the tags at their positions
    line.insert(position + inserted_length,insert)
    inserted_length += insert.size
  end
  puts line
end

Result

THEREISACATINTHETREE.
THEREISACAT\start{1}INTHE\end{1}TREE.
THETREEHASACAT.
THETREEHASACAT.
THECATTREEHASMANYBIRDS.
THECATTREEHASMANYBIRDS.
THECATANDLIONLEFTTHEFORESTANDMETANDOWLINTREENEARTHEWATERFALL.
THECAT\start{1}ANDLION\start{2}LEFTTHE\end{2}FORESTANDMETANDOWL\start{3}IN\end{1}TREENEARTHE\end{3}WATERFALL.
THECATDOESNOTLIKETALLTREES,BUTINSTEADLIKESSHORTTREES.
THECAT\start{1}\start{2}DOESNOTLIKETALL\end{1}TREES,BUTINSTEADLIKESSHORT\end{2}TREES.
CAT...TREE...CAT...TREE
CAT\start{1}\start{2}...\end{1}TREE...CAT\start{3}...\end{2}\end{3}TREE

EDIT

I inserted some comments and clarified some of the variables.

like image 24
Aleksander Pohl Avatar answered Oct 07 '22 18:10

Aleksander Pohl