Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to find 'good enough' sequences

Tags:

c#

regex

I'm looking to implement some algorithm to help me match imperfect sequences.

Say I have a stored sequence of ABBABABBA and I want to find something that 'looks like' that in a large stream of characters.

If I give my algorithm the allowance to have 2 wildcards (differences), how can I use Regex to match something like: where ( and ) mark the differences:

A(A)BABAB(A)A 
or 
(B)BBA(A)ABBA

My Dilemma is that I am looking to find these potential target matches (with imperfections) in a big string of characters. So in something like:

ABBDBABDBCBDBABDB(A(A)BABAB(A)A)DBDBABDBCBDBAB
ADBDBABDBDBDBCBDBABCBDBABCBDBABCBDBABABBBDBABABBCD
DBABCBDABDBABCBCBDBABABDABDBABCBDBABABDDABCBDBABAB

I must be able to search for these 'near enough' matches.
Where brackets denote: (The Good enough Match with the (Differences))

Edit: To be more formal in this example, A match of Length N can be accepted if N-2 characters are the same as the original (2 Differences)

I've used Regex before, but only to find perfect sequences - not for something that 'looks like' one.

Hope this is clear enough to get some advice on. Thanks for reading and any help!

like image 368
Rodney Avatar asked Jun 27 '13 13:06

Rodney


4 Answers

You could use LINQ to be nice and expressive.

In order to use this make sure you have a using System.Linq at the top of your code.

Assuming that

  • source is the stored target pattern
  • test is the string to test.

Then you can do

public static bool IsValid(string source, string test) 
{
  return test != null  
         && source != null 
         && test.Length == source.Length 
         && test.Where((x,i) => source[i] != x).Count() <=2
}

There is also a shortcut version that exits false the moment it fails, saving iterating the rest of the string.

public static bool IsValid(string source, string test) 
{
   return test != null  
          && source != null 
          && test.Length == source.Length 
          && !test.Where((x,i) => source[i] != x).Skip(2).Any();
}

As requested in comments, a little explanation of how this works

in C# a string can be treated as an array of characters, which means that the Linq methods can be used on it.

test.Where((x,i) => source[i] != x)

This uses the overload of Where that for each character in test, x gets assigned to the character and i gets assigned to the index. If the condition character at position i in source is not equal to x then output into the result.

Skip(2)

this skips the first 2 results.

Any()

this returns true if there any results left or false if not. Because linq defers execution the moment that this is false the function exits rather than evaluating the rest of the string.

The entire test is then negated by prefixing with a '!' to indicate we want to know where there are no more results.

Now in order to match as substring you are going to need to behave similar to a regex backtracking...

public static IEnumerable<int> GetMatches(string source, string test)
{
   return from i in Enumerable.Range(0,test.Length - source.Length)
      where IsValid(source, !test.Skip(i).Take(source.Length))
          select i;
}

public static bool IsValid(string source, IEnumerable<char> test) 
{
   return test.Where((x,i) => source[i] != x).Skip(2).Any();
}

UPDATE Explained

Enumerable.Range(0,test.Length - source.Length)

This creates a sequence of numbers from 0 to test.Length - source.Length, there is no need in checking starting at every char in test because once the length is shorter the answer is invalid.

from i in ....

Basically iterate over the collection assigning i to be the current value each time

where IsValid(source, !test.Skip(i).Take(source.Length))

Filter the results to only include the ones where there is a match in test starting at index i (hence the skip) and going on for source.Length chars (hence the take.

select i

return i

This returns an enumerable over the indexes in test where there is a match, you could extract them with

GetMatches(source,test).Select(i => 
                      new string(test.Skip(i).Take(source.Length).ToArray()));
like image 116
Bob Vale Avatar answered Nov 09 '22 19:11

Bob Vale


I don't think this can be done with regexes (if it can, I'm unfamiliar with the syntax). However, you can use the dynamic programming algorithm for Levenshtein distance.

Edit: If you don't need to handle letters that have switched positions, a much easier approach is to just compare each pair of characters from the two strings, and just count the number of differences.

like image 32
Aasmund Eldhuset Avatar answered Nov 09 '22 20:11

Aasmund Eldhuset


I can't think how you'd do it with regex but it should be pretty simple to code.

I'd probably just split the strings up and compare them character by character. If you get a difference count it and move to the next character. If you exceed 2 differences then move on to the next full string.

like image 2
ydaetskcoR Avatar answered Nov 09 '22 19:11

ydaetskcoR


I don't think there's a good regular expression to handle this case. (Or at least, there isn't one that won't take up a good three lines of text and cause multiple bullets in your feet.) However, that doesn't mean you can't solve this problem.

Depending on how large your strings are (I'm assuming they won't be millions of characters each) I don't see anything stopping you from using a single loop to compare individuals character in order, while keeping a tally of differences:

int differences = 0;    // Count of discrepancies you've detected
int tolerance = 7;    // Limit of discrepancies you'll allow

CheckStrings(int differences, int tolerance) {
    for (i = 0; i < StringA.Length; i++)
    {
        if (StringA[i] != StringB[i]) {
            differences++;
            if (differences > tolerance) {
                return false;
            }
        }
    }
    return true;
}

Most of the time, don't be concerned about your strings being too long to put into a loop. Behind-the-scenes, any code that assesses every character of a string will loop in some form or another. Until you literally have millions of characters to deal with, a loop should do the trick just fine.

like image 2
4444 Avatar answered Nov 09 '22 21:11

4444