I'm having some trouble with the re.finditer() method in python. For example:
>>>sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
>>>[[m.start(),m.end()] for m in re.finditer(r'(?=gatttaacg)',sequence)]
out: [[22,22]]
As you can see, the start()
and end()
methods are giving the same value. I've noticed this before and just ended up using m.start()+len(query_sequence)
, instead of m.end()
, but I am very confused why this is happening.
The finditer() function matches a pattern in a string and returns an iterator that yields the Match objects of all non-overlapping matches.
The re. finditer() works exactly the same as the re. findall() method except it returns an iterator yielding match objects matching the regex pattern in a string instead of a list. It scans the string from left to right, and matches are returned in the iterator form.
Use the Python regex findall() function to get a list of matched strings.
But finditer and findall are finding different things. Findall indeed finds all the matches in the given string. But finditer only finds the first one, returning an iterator with only one element.
The regex module supports overlapping with finditer :
import regex
sequence = 'acaca'
print [[m.start(), m.end()] for m in regex.finditer(r'(aca)', sequence, overlapped=1)]
[0, 3], [2, 5]]
sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.end()] for m in re.finditer(r'(gatttaacg)',sequence)]
remove the lookahead
.It does not capture only asserts.
Output:[[22, 31]]
if you have to use lookahead
use
sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.start()+len("aca")] for m in re.finditer(r'(?=aca)',sequence)]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With