Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the difference between findall() and using a for loop with an iterator to find pattern matches

Tags:

python

regex

I'm using this to calculate the number of sentences in a text:

fileObj = codecs.open( "someText.txt", "r", "utf-8" )
shortText = fileObj.read()

pat = '[.]'

for match in re.finditer(pat, shortText, re.UNICODE):
    nSentences = nSentences+1

Someone told me this is better:

result = re.findall(pat, shortText)
nSentences = len(result)

Is there a difference? Don't they do the same thing?

like image 977
Michael Eilers Smith Avatar asked Jan 18 '23 14:01

Michael Eilers Smith


1 Answers

The second is probably going to be a little faster, since the iteration is done entirely in C. How much faster? About 15% in my tests (matching 'a' in 'a' * 16), though that percentage will get smaller as the regex gets more complex and takes a larger proportion of the running time. But it will use more memory since it's actually going to create a list for you. Assuming you don't have a ton of matches, though, not too much more memory.

As to which I'd prefer, I do kind of like the second's conciseness, especially when combined into a single statement:

nSentences = len(re.findall(pat, shortText))
like image 142
kindall Avatar answered Apr 28 '23 04:04

kindall