I am trying to match rows in a file containing a string say ACTGGGTAAACTA
. If
I do
grep "ACTGGGTAAACTA" file
It gives me rows which have exact matches. Is there a way to allow for certain number of mismatches (substitutions, insertions or deletions)? For example, I am looking for sequences
Up to 3 allowed subtitutions like "AGTGGGTAACCAA" etc.
Insertions/deletions (having a partial match like "ACTGGGAAAATAAACTA" or "ACTAAACTA")
There's a Python library called fuzzysearch (that I wrote) which provides precisely the required functionality.
Here's some sample code that should work:
from fuzzysearch import find_near_matches
with open('path/to/file', 'r') as f:
data = f.read()
# 1. search allowing up to 3 substitutions
matches = find_near_matches("ACTGGGTAAACTA", data, max_substitutions=3)
# 2. also allow insertions and deletions, i.e. allow an edit distance
# a.k.a. Levenshtein distance of up to 3
matches = find_near_matches("ACTGGGTAAACTA", data, max_l_dist=3)
You can use tre-agrep
and specify the edit distance with the -E
switch. For example if you have a file foo
:
cat <<< EOF > foo
ACTGGGAAAATAAACTA
ACTAAACTA
ACTGGGTAAACTA
EOF
You can match every line with an edit distance of up to 9 like this:
tre-agrep -s -9 -w ACTGGGTAAACTA foo
Output:
4:ACTGGGAAAATAAACTA
4:ACTAAACTA
0:ACTGGGTAAACTA
There used to be a tool called agrep
for fuzzy regex matching, but it got abandoned.
http://en.wikipedia.org/wiki/Agrep has a bit of history and links to related tools.
https://github.com/Wikinaut/agrep looks like a revived open source release, but I have not tested it.
Failing that, see if you can find tre-agrep
for your distro.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With