Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python regex non-greedy acting like greedy

I am working with transcripts and having trouble with matching patterns in non-greedy fashion. It is still grabbing way too much and looks like doing greedy matches.

A transcript looks like this:

>> John doe: Hello, I am John Doe.

>> Hello, I am Jane Doe.

>> Thank you for coming, we will start in two minutes.

>> Sam Smith: [no audio] Good morning, everyone.

To find the name of speakers within >> (WHATEVER NAME):, I wrote

pattern=re.compile(r'>>(.*?):')
transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'
re.findall(pattern, transcript)

I expected 'John Doe' and 'Sam Smith', but it is giving me 'John Doe' and 'Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith'

I am confused because .*? is non-greedy, which (I think) should be able to grab 'Sam Smith'. How should I fix the code so that it only grabs whatever in >> (WHATEVER NAME):? Also, I am using Python 3.6.

Thanks!

like image 319
ybcha204 Avatar asked Jan 28 '26 00:01

ybcha204


2 Answers

Do you really need regex? You can split on >> prompts and then filter out your names.

>>> [i.split(':')[0].strip() for i in transcript.split('>>') if ':' in i]
['John doe', 'Sam Smith']
like image 192
cs95 Avatar answered Jan 30 '26 13:01

cs95


Your understanding of a non-greedy regex is slightly off. Non-greedy means it will match the shortest match possible from when it begins matching. It will not change the character it begins matching from if another one is found in the match.

For example:

start.*?stop

Will match all of startstartstop, because once it starts matching at start it will keep matching until it finds stop. Non-greedy simply means that for the string startstartstopstop, it would only match up until the first stop.

For your question, this is an easy problem to solve using positive lookahead.

You may use >> ([a-zA-Z ]+)(?=:):

>>> transcript='>> John doe: Hello, I am John Doe. >> Hello, I am Jane Doe. >> Thank you for coming, we will start in two minutes. >> Sam Smith: [no audio] Good morning, everyone.'    
>>> re.findall(r'>> ([a-zA-Z ]+)(?=:)', transcript)
['John doe', 'Sam Smith']
like image 26
user3483203 Avatar answered Jan 30 '26 12:01

user3483203



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!