Python regex non-greedy acting like greedy

Question

I am working with transcripts and having trouble with matching patterns in non-greedy fashion. It is still grabbing way too much and looks like doing greedy matches.

A transcript looks like this:

>> John doe: Hello, I am John Doe.

>> Hello, I am Jane Doe.

>> Thank you for coming, we will start in two minutes.

>> Sam Smith: [no audio] Good morning, everyone.

To find the name of speakers within >> (WHATEVER NAME):, I wrote

pattern=re.compile(r'>>(.*?):')
transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'
re.findall(pattern, transcript)

I expected 'John Doe' and 'Sam Smith', but it is giving me 'John Doe' and 'Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith'

I am confused because .*? is non-greedy, which (I think) should be able to grab 'Sam Smith'. How should I fix the code so that it only grabs whatever in >> (WHATEVER NAME):? Also, I am using Python 3.6.

Thanks!

cs95 · Accepted Answer

Do you really need regex? You can split on >> prompts and then filter out your names.

>>> [i.split(':')[0].strip() for i in transcript.split('>>') if ':' in i]
['John doe', 'Sam Smith']

user3483203 · Answer

Your understanding of a non-greedy regex is slightly off. Non-greedy means it will match the shortest match possible from when it begins matching. It will not change the character it begins matching from if another one is found in the match.

For example:

start.*?stop

Will match all of startstartstop, because once it starts matching at start it will keep matching until it finds stop. Non-greedy simply means that for the string startstartstopstop, it would only match up until the first stop.

For your question, this is an easy problem to solve using positive lookahead.

You may use >> ([a-zA-Z ]+)(?=:):

>>> transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'    
>>> re.findall(r'>> ([a-zA-Z ]+)(?=:)', transcript)
['John doe', 'Sam Smith']

Python regex non-greedy acting like greedy

Tags:

python

regex

python-3.x

regex-greedy

non-greedy

ybcha204

2 Answers

cs95

user3483203

Recent Activity

Donate For Us

Python regex non-greedy acting like greedy

Tags:

python

regex

python-3.x

regex-greedy

non-greedy

ybcha204

2 Answers

cs95

user3483203

Related questions

Recent Activity

Donate For Us