Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

re.finditer() returning same value for start and end methods

I'm having some trouble with the re.finditer() method in python. For example:

>>>sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
>>>[[m.start(),m.end()] for m in re.finditer(r'(?=gatttaacg)',sequence)]

out: [[22,22]]

As you can see, the start() and end() methods are giving the same value. I've noticed this before and just ended up using m.start()+len(query_sequence), instead of m.end(), but I am very confused why this is happening.

like image 587
lstbl Avatar asked Jan 13 '16 18:01

lstbl


People also ask

What does finditer return?

The finditer() function matches a pattern in a string and returns an iterator that yields the Match objects of all non-overlapping matches.

How does re finditer work?

The re. finditer() works exactly the same as the re. findall() method except it returns an iterator yielding match objects matching the regex pattern in a string instead of a list. It scans the string from left to right, and matches are returned in the iterator form.

How to find all matches of a regex on a string Python?

Use the Python regex findall() function to get a list of matched strings.

What is the difference between Findall and Finditer?

But finditer and findall are finding different things. Findall indeed finds all the matches in the given string. But finditer only finds the first one, returning an iterator with only one element.


2 Answers

The regex module supports overlapping with finditer :

import regex
sequence = 'acaca'
print [[m.start(), m.end()] for m in regex.finditer(r'(aca)', sequence, overlapped=1)]
[0, 3], [2, 5]]
like image 98
Padraic Cunningham Avatar answered Oct 17 '22 19:10

Padraic Cunningham


sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.end()] for m in re.finditer(r'(gatttaacg)',sequence)]

remove the lookahead .It does not capture only asserts.

Output:[[22, 31]]

if you have to use lookahead use

sequence = 'atgaggagccccaagcttactcgatttaacgcccgcagcctcgccaaaccaccaaacacacca'
print [[m.start(),m.start()+len("aca")] for m in re.finditer(r'(?=aca)',sequence)]
like image 35
vks Avatar answered Oct 17 '22 18:10

vks