Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find all matches with a regex where part of the match overlaps

I have a long .txt file. I want to find all the matching results with regex.

for example :

test_str = 'ali. veli. ahmet.'
src = re.finditer(r'(\w+\.\s){1,2}', test_str, re.MULTILINE)
print(*src)

this code returns :

<re.Match object; span=(0, 11), match='ali. veli. '>

i need;

['ali. veli', 'veli. ahmet.']

how can i do that with regex?

like image 711
Esat Mahmut Bayol Avatar asked May 16 '20 22:05

Esat Mahmut Bayol


People also ask

What does ?= Mean in regex?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

How do you match line breaks in regex?

If you want to indicate a line break when you construct your RegEx, use the sequence “\r\n”. Whether or not you will have line breaks in your expression depends on what you are trying to match. Line breaks can be useful “anchors” that define where some pattern occurs in relation to the beginning or end of a line.

Which method returns the match object if there is a match found in the string?

The match() method retrieves the result of matching a string against a regular expression.

What does the regular expression '[ a za z ]' match?

For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.


1 Answers

The (\w+\.\s){1,2} pattern contains a repeated capturing group, and Python re does not store all the captures it finds, it only saves the last one into the group memory buffer. At any rate, you do not need the repeated capturing group because you need to extract multiple occurrences of the pattern from a string, and re.finditer or re.findall will do that for you.

Also, the re.MULTILINE flag is not necessar here since there are no ^ or $ anchors in the pattern.

You may get the expected results using

import re
test_str = 'ali. veli. ahmet.'
src = re.findall(r'(?=\b(\w+\.\s+\w+))', test_str)
print(src)
# => ['ali. veli', 'veli. ahmet']

See the Python demo

The pattern means

  • (?= - start of a positive lookahead
    • \b - a word boundary (crucial here, it is necessary to only start capturing at word boundaries)
    • (\w+\.\s+\w+) - Capturing group 1: 1+ word chars, ., 1+ whitespaces and 1+ word chars
  • ) - end of the lookahead.
like image 200
Wiktor Stribiżew Avatar answered Oct 17 '22 11:10

Wiktor Stribiżew