Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a new regex based on the returned results and rules of a previous regex | Indexing a regex and seeing how the regex has matched a substring

I am particularly looking at R, Perl, and shell. But any other programming language would be fine too.

QUESTION

Is there a way to visually or programmatically inspect and index a matched string based on the regex? This is intended for referencing back to the first regex and its results inside of a second regex, so as to be able to modify a part of the matched string and write new rules for that particular part.

https://regex101.com does visualize how a certain string matches the regular expression. But it is far from perfect and is not efficient for my huge dataset.

PROBLEM

I have around 12000 matched strings (DNA sequences) for my first regex, and I want to process these strings and based on some strict rules find some other strings in a second file that go well together with those 12000 matches based on those strict rules.

SIMPLIFIED EXAMPLE

This is my first regex (a simplified, shorter version of my original regex) that runs through my first text file.

[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)

Let's suppose that it finds the following three sub-strings in my large text file:

1. AAACCCGTGTAATAACAGACGTACTGTGTA
2. TTTTTTTGCGACCGAGAAACGGTTCTGTGTA
3. TAACAAGGACCCTGTGTA

Now I have a second file which includes a very large string. From this second file, I am only interested in extracting those sub-strings that match a new (second) regex which itself is dependent on my first regex in few sections. Therefore, this second regex has to take into account the substrings matched in the first file and look at how they have matched to the first regex!

Allow me, for the sake of simplicity, index my first regex for better illustration in this way:

first.regex.p1 = [ACGT]{1,12000}
first.regex.p2 = (AAC)
first.regex.p3 = [AG]{2,5}
first.regex.p4 = [ACGT]{2,5}
first.regex.p5 = (CTGTGTA)

Now my second (new) regex which will search the second text file and will be dependent on the results of the first regex (and how the substrings returned from the first file have matched the first regex) will be defined in the following way:

second.regex = (CTAAA)[AC]{5,100}(TTTGGG){**rule1**} (CTT)[AG]{10,5000}{**rule2**}

In here rule1 and rule2 are dependent on the matches coming from the first regex on the first file. Hence;

rule1 = look at the matched strings from file1 and complement the pattern of first.regex.p3 that is found in the matched substring from file1 (the complement should of course have the same length)
rule2 = look at the matched strings from file1 and complement the pattern of first.regex.p4 that is found in the matched substring from file1 (the complement should of course have the same length)

You can see that second regex has sections that belong to itself (i.e. they are independent of any other file/regex), but it also has sections that are dependent on the results of the first file and the rules of the first regex and how each sub-string in the first file has matched that first regex!

Now again for the sake of simplicity, I use the third matched substring from file1 (because it is shorter than the other two) to show you how a possible match from the second file looks like and how it satisfies the second regex:

This is what we had from our first regex run through the first file:

3. TAACAAGGACCCTGTGTA

So in this match, we see that:

T has matched first.regex.p1
AAC has matched first.regex.p2
AAGGA has matched first.regex.p3
CC first.regex.p4
CTGTGTA has matched first.regex.p5

Now in our second regex for the second file we see that when looking for a substring that matches the second regex, we are dependent on the results coming from the first file (which match the first regex). Particularly we need to look at the matched substrings and complement the parts that matched first.regex.p3 and first.regex.p4 (rule1 and rule2 from second.regex).

complement means:
A will be substituted by T
T -> A
G -> C
C -> G

So if you have TAAA, the complement will be ATTT.

Therefore, going back to this example:

  1. TAACAAGGACCCTGTGTA

We need to complement the following to satisfy the requirements of the second regex:

AAGGA has matched first.regex.p3
CC first.regex.p4

And complements are:

TTCCT (based on rule1)
GG (based on rule2)

So an example of a substring that matches second.regex is this:

CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG

This is only one example! But in my case I have 12000 matched substrings!! I cannot figure out how to even approach this problem. I have tried writing pure regex but I have completely failed to implement anything that properly follows this logic.. Perhaps I shouldn't be even using regex?

Is it possible to do this entirely with regex? Or should I look at another approach? Is it possible to index a regex and in the second regex reference back to the first regex and force the regex to consider the matched substrings as returned by first regex?

like image 902
l.. Avatar asked Sep 11 '17 08:09

l..


People also ask

What does regex match return?

The Match(String, String) method returns the first substring that matches a regular expression pattern in an input string. For information about the language elements used to build a regular expression pattern, see Regular Expression Language - Quick Reference.

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

What does \+ mean in regex?

Example: The regex "aa\n" tries to match two consecutive "a"s at the end of a line, inclusive the newline character itself. Example: "a\+" matches "a+" and not a series of one or "a"s. ^ the caret is the anchor for the start of the string, or the negation symbol.


1 Answers

This can be done programmatically in Perl, or any other language.

Since you need input from two different files, you cannot do this in pure regex, as regex cannot read files. You cannot even do it in one pattern, as no regex engine remembers what you matched before on a different input string. It has to be done in the program surrounding your matches, which should very well be regex, as that's what regex is meant for.

You can build the second pattern up step by step. I've implemented a more advanced version in Perl that can easily be adapted to suit other pattern combinations as well, without changing the actual code that does the work.

Instead of file 1, I will use the DATA section. It holds all three example input strings. Instead of file 2, I use your example output for the third input string.

The main idea behind this is to split up both patterns into sub-patterns. For the first one, we can simply use an array of patterns. For the second one, we create anonymous functions that we will call with the match results from the first pattern to construct the second complete pattern. Most of them just return a fixed string, but two actually take a value from the arguments to build the complements.

use strict;
use warnings;

sub complement {
    my $string = shift;
    $string =~ tr/ATGC/TACG/; # this is a transliteration, faster than s///
    return $string;
}

# first regex, split into sub-patterns
my @first = ( 
    qr([ACGT]{1,12000}), 
    qr(AAC), 
    qr([AG]{2,5}), 
    qr([ACGT]{2,5}), 
    qr(CTGTGTA), 
);

# second regex, split into sub-patterns as callbacks
my @second = (
    sub { return qr(CTAAA) },
    sub { return qr([AC]{5,100}) },
    sub { return qr(TTTGGG) },
    sub {
        my (@matches) = @_;

        # complement the pattern of first.regex.p3
        return complement( $matches[3] );
    },
    sub { return qr(CTT) },
    sub { return qr([AG]{10,5000}) },
    sub {
        my (@matches) = @_;

        # complement the pattern of first.regex.p4
        return complement( $matches[4] );
    },
);

my $file2 = "CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG";

while ( my $file1 = <DATA> ) {

    # this pattern will match the full thing in $1, and each sub-section in $2, $3, ...
    # @matches will contain (full, $2, $3, $4, $5, $6)
    my @matches = ( $file1 =~ m/(($first[0])($first[1])($first[2])($first[3])($first[4]))/g );

    # iterate the list of anonymous functions and call each of them,
    # passing in the match results of the first match
    my $pattern2 = join q{}, map { '(' . $_->(@matches) . ')' } @second;

    my @matches2 = ( $file2 =~ m/($pattern2)/ );
}

__DATA__
AAACCCGTGTAATAACAGACGTACTGTGTA
TTTTTTTGCGACCGAGAAACGGTTCTGTGTA
TAACAAGGACCCTGTGTA

These are the generated second patterns for your three input substrings.

((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(TCT)((?^:CTT))((?^:[AG]{10,5000}))(GCAT)
((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(CC)((?^:CTT))((?^:[AG]{10,5000}))(AA)
((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(TTCCT)((?^:CTT))((?^:[AG]{10,5000}))(GG)

If you're not familiar with this, it's what happens if you print a pattern that was constructed with the quoted regex operator qr//.

The pattern matches your example output for the third case. The resulting @matches2 looks like this when dumped out using Data::Printer.

[
    [0] "CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG",
    [1] "CTAAA",
    [2] "ACACC",
    [3] "TTTGGG",
    [4] "TTCCT",
    [5] "CTT",
    [6] "AAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAG",
    [7] "GG"
]

I cannot say anything about speed of this implementation, but I believe it will be reasonable fast.

If you wanted to find other combinations of patterns, all you had to do was replace the sub { ... } entries in those two arrays. If there is a different number than five of them for the first match, you'd also construct that pattern programmatically. I've not done that above to keep things simpler. Here's what it would look like.

my @matches = ( $file1 =~ join q{}, map { "($_)" } @first);

If you want to learn more about this kind of strategy, I suggest you read Mark Jason Dominus' excellent Higher Order Perl, which is available for free as a PDF here.

like image 159
simbabque Avatar answered Sep 17 '22 17:09

simbabque