Creating a new regex based on the returned results and rules of a previous regex | Indexing a regex and seeing how the regex has matched a substring

Tags:

I am particularly looking at R, Perl, and shell. But any other programming language would be fine too.

QUESTION

Is there a way to visually or programmatically inspect and index a matched string based on the regex? This is intended for referencing back to the first regex and its results inside of a second regex, so as to be able to modify a part of the matched string and write new rules for that particular part.

https://regex101.com does visualize how a certain string matches the regular expression. But it is far from perfect and is not efficient for my huge dataset.

PROBLEM

I have around 12000 matched strings (DNA sequences) for my first regex, and I want to process these strings and based on some strict rules find some other strings in a second file that go well together with those 12000 matches based on those strict rules.

SIMPLIFIED EXAMPLE

This is my first regex (a simplified, shorter version of my original regex) that runs through my first text file.

[ACGT]{1,12000}(AAC)[AG]{2,5}[ACGT]{2,5}(CTGTGTA)

Let's suppose that it finds the following three sub-strings in my large text file:

1. AAACCCGTGTAATAACAGACGTACTGTGTA
2. TTTTTTTGCGACCGAGAAACGGTTCTGTGTA
3. TAACAAGGACCCTGTGTA

Now I have a second file which includes a very large string. From this second file, I am only interested in extracting those sub-strings that match a new (second) regex which itself is dependent on my first regex in few sections. Therefore, this second regex has to take into account the substrings matched in the first file and look at how they have matched to the first regex!

Allow me, for the sake of simplicity, index my first regex for better illustration in this way:

first.regex.p1 = [ACGT]{1,12000}
first.regex.p2 = (AAC)
first.regex.p3 = [AG]{2,5}
first.regex.p4 = [ACGT]{2,5}
first.regex.p5 = (CTGTGTA)

Now my second (new) regex which will search the second text file and will be dependent on the results of the first regex (and how the substrings returned from the first file have matched the first regex) will be defined in the following way:

second.regex = (CTAAA)[AC]{5,100}(TTTGGG){**rule1**} (CTT)[AG]{10,5000}{**rule2**}

In here rule1 and rule2 are dependent on the matches coming from the first regex on the first file. Hence;

rule1 = look at the matched strings from file1 and complement the pattern of first.regex.p3 that is found in the matched substring from file1 (the complement should of course have the same length)
rule2 = look at the matched strings from file1 and complement the pattern of first.regex.p4 that is found in the matched substring from file1 (the complement should of course have the same length)

You can see that second regex has sections that belong to itself (i.e. they are independent of any other file/regex), but it also has sections that are dependent on the results of the first file and the rules of the first regex and how each sub-string in the first file has matched that first regex!

Now again for the sake of simplicity, I use the third matched substring from file1 (because it is shorter than the other two) to show you how a possible match from the second file looks like and how it satisfies the second regex:

This is what we had from our first regex run through the first file:

3. TAACAAGGACCCTGTGTA

So in this match, we see that:

T has matched first.regex.p1
AAC has matched first.regex.p2
AAGGA has matched first.regex.p3
CC first.regex.p4
CTGTGTA has matched first.regex.p5

Now in our second regex for the second file we see that when looking for a substring that matches the second regex, we are dependent on the results coming from the first file (which match the first regex). Particularly we need to look at the matched substrings and complement the parts that matched first.regex.p3 and first.regex.p4 (rule1 and rule2 from second.regex).

complement means:
A will be substituted by T
T -> A
G -> C
C -> G

So if you have TAAA, the complement will be ATTT.

Therefore, going back to this example:

TAACAAGGACCCTGTGTA

We need to complement the following to satisfy the requirements of the second regex:

AAGGA has matched first.regex.p3
CC first.regex.p4

And complements are:

TTCCT (based on rule1)
GG (based on rule2)

So an example of a substring that matches second.regex is this:

CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG

This is only one example! But in my case I have 12000 matched substrings!! I cannot figure out how to even approach this problem. I have tried writing pure regex but I have completely failed to implement anything that properly follows this logic.. Perhaps I shouldn't be even using regex?

Is it possible to do this entirely with regex? Or should I look at another approach? Is it possible to index a regex and in the second regex reference back to the first regex and force the regex to consider the matched substrings as returned by first regex?

902

asked Sep 11 '17 08:09

l..

1 Answers

This can be done programmatically in Perl, or any other language.

Since you need input from two different files, you cannot do this in pure regex, as regex cannot read files. You cannot even do it in one pattern, as no regex engine remembers what you matched before on a different input string. It has to be done in the program surrounding your matches, which should very well be regex, as that's what regex is meant for.

You can build the second pattern up step by step. I've implemented a more advanced version in Perl that can easily be adapted to suit other pattern combinations as well, without changing the actual code that does the work.

Instead of file 1, I will use the DATA section. It holds all three example input strings. Instead of file 2, I use your example output for the third input string.

The main idea behind this is to split up both patterns into sub-patterns. For the first one, we can simply use an array of patterns. For the second one, we create anonymous functions that we will call with the match results from the first pattern to construct the second complete pattern. Most of them just return a fixed string, but two actually take a value from the arguments to build the complements.

use strict;
use warnings;

sub complement {
    my $string = shift;
    $string =~ tr/ATGC/TACG/; # this is a transliteration, faster than s///
    return $string;
}

# first regex, split into sub-patterns
my @first = ( 
    qr([ACGT]{1,12000}), 
    qr(AAC), 
    qr([AG]{2,5}), 
    qr([ACGT]{2,5}), 
    qr(CTGTGTA), 
);

# second regex, split into sub-patterns as callbacks
my @second = (
    sub { return qr(CTAAA) },
    sub { return qr([AC]{5,100}) },
    sub { return qr(TTTGGG) },
    sub {
        my (@matches) = @_;

        # complement the pattern of first.regex.p3
        return complement( $matches[3] );
    },
    sub { return qr(CTT) },
    sub { return qr([AG]{10,5000}) },
    sub {
        my (@matches) = @_;

        # complement the pattern of first.regex.p4
        return complement( $matches[4] );
    },
);

my $file2 = "CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG";

while ( my $file1 = <DATA> ) {

    # this pattern will match the full thing in $1, and each sub-section in $2, $3, ...
    # @matches will contain (full, $2, $3, $4, $5, $6)
    my @matches = ( $file1 =~ m/(($first[0])($first[1])($first[2])($first[3])($first[4]))/g );

    # iterate the list of anonymous functions and call each of them,
    # passing in the match results of the first match
    my $pattern2 = join q{}, map { '(' . $_->(@matches) . ')' } @second;

    my @matches2 = ( $file2 =~ m/($pattern2)/ );
}

__DATA__
AAACCCGTGTAATAACAGACGTACTGTGTA
TTTTTTTGCGACCGAGAAACGGTTCTGTGTA
TAACAAGGACCCTGTGTA

These are the generated second patterns for your three input substrings.

((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(TCT)((?^:CTT))((?^:[AG]{10,5000}))(GCAT)
((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(CC)((?^:CTT))((?^:[AG]{10,5000}))(AA)
((?^:CTAAA))((?^:[AC]{5,100}))((?^:TTTGGG))(TTCCT)((?^:CTT))((?^:[AG]{10,5000}))(GG)

If you're not familiar with this, it's what happens if you print a pattern that was constructed with the quoted regex operator qr//.

The pattern matches your example output for the third case. The resulting @matches2 looks like this when dumped out using Data::Printer.

[
    [0] "CTAAAACACCTTTGGGTTCCTCTTAAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAGGG",
    [1] "CTAAA",
    [2] "ACACC",
    [3] "TTTGGG",
    [4] "TTCCT",
    [5] "CTT",
    [6] "AAAAAAAAAGGGGGAGAGAGAAGAAAAAAAGAGAG",
    [7] "GG"
]

I cannot say anything about speed of this implementation, but I believe it will be reasonable fast.

If you wanted to find other combinations of patterns, all you had to do was replace the sub { ... } entries in those two arrays. If there is a different number than five of them for the first match, you'd also construct that pattern programmatically. I've not done that above to keep things simpler. Here's what it would look like.

my @matches = ( $file1 =~ join q{}, map { "($_)" } @first);

If you want to learn more about this kind of strategy, I suggest you read Mark Jason Dominus' excellent Higher Order Perl, which is available for free as a PDF here.

159

answered Sep 17 '22 17:09

simbabque

Related questions
                            
                                How can I increase the length of the command history in R?
                            
                                Imports and Depends
                            
                                How to automatically create BibTex citations for R packages in knitr file?
                            
                                best practices for avoiding roundoff gotchas in date manipulation
                            
                                R - convert SpatialLines into raster
                            
                                What does the Autoloads environment do?
                            
                                data.table - does setkey(...) create an index or physically reorder the rows in a data table?
                            
                                ggplot2 graph quality in shiny on shinyapps.io
                            
                                How to separate Title Page and Table of Content Page from knitr rmarkdown PDF?
                            
                                Reordering factor gives different results, depending on which packages are loaded
                            
                                Failure to connect to odbc database in R
                            
                                Displaying image on point hover in Plotly
                            
                                Observe modal (easy) closing in Shiny
                            
                                assigning by reference into loaded package datasets
                            
                                Sub-assign by reference on vector in R
                            
                                R: Filling missing dates in a time series?
                            
                                R: creating a named vector from variables
                            
                                print vs. echo in R
                            
                                Plot does not resize 100% width after show/hide sidebar in R shiny page
                            
                                R multiple conditions in if statement [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Creating a new regex based on the returned results and rules of a previous regex | Indexing a regex and seeing how the regex has matched a substring

Tags:

regex

shell

r

perl

bioinformatics

l..

People also ask

1 Answers

simbabque

Recent Activity

Donate For Us