Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Regex Bullet Point and multiple line match

Tags:

python

regex

Wanting to match something that's between two words or phrases, that has a bullet point inside of it, is on multiple lines in python and works for every variation of the words between the beginning and end. Don't know the identifier used for bullet points or the identifiers to match everything including line breaks. For example trying to match:

Hello • World Hello • World Hello • World Hello • World Hello • World Hello • World 

in

 hello_big_old_world = "qweqrqr  Hello • World Hello • World Hello • World Hello • World Hello • World Hello • World fdsfdas"

Where this string is over multiple lines. I know its probably not in the ball park, but here's what I have so far and obviously it isn't working.

Answer = re.findall("(?<=qweqrqr)(.*\n?)/s(?=fdsfdas)"), hello_big_old_world)
print(Answer)

Thanks in Advance.

like image 787
crooose Avatar asked May 09 '18 11:05

crooose


3 Answers

You may match the string from qweqrqr to fdsfdas with at least 1 bullet point using

hello_big_old_world = "qweqrqr  Hello • World Hello • World Hello • World Hello • World Hello • World Hello • World fdsfdas"
print(re.findall(r'qweqrqr([^\u2022]*\u2022.*?)fdsfdas', hello_big_old_world, re.S))

See the Python 3 demo.

Note that you may use instead of the Unicode char representation and also strip the whitespaces from the captured text if you add \s* (=0+ whitespace chars) on both ends of the parenthetical group:

re.findall(r'qweqrqr\s*([^•]*•.*?)\s*fdsfdas', hello_big_old_world, re.S)

It should work in both Python 3 and Python 2.

Details

  • qweqrqr - matches the right delimiter
  • ([^\u2022]*\u2022.*?) / ([^•]*•.*?) - captures into Group 1 (the string returned with re.findall)
    • [^\u2022]* / [^•]* - any chars other than the bullet point
    • \u2022 / - the bullet point
    • .*? - any 0+ chars (including a newline due to the re.S (=re.DOTALL) flag) as few as possible (due to the lazy quantifier *?)
  • fdsfdas - matches the left delimiter
like image 105
Wiktor Stribiżew Avatar answered Oct 22 '22 16:10

Wiktor Stribiżew


To match all characters including newlines, you still use the . character, but pass flags=re.DOTALL to functions such as re.findall.

like image 29
Alex Hall Avatar answered Oct 22 '22 17:10

Alex Hall


You can use your regex with slight changes:

  • /s should be \s.

  • use re.DOTALL to match cases where you have newlines in-between.

Working code:

import re

hello_big_old_world = 'qweqrqr  Hello • World Hello • World Hello • World Hello • World Hello • World Hello • World fdsfdas'

Answer = re.findall("(?<=qweqrqr)(.*\n?)\s(?=fdsfdas)", hello_big_old_world, re.DOTALL)
print(Answer)

# [' Hello • World Hello • World Hello • World Hello • World Hello • World Hello • World']                        
like image 1
Austin Avatar answered Oct 22 '22 16:10

Austin