Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How would I match a string that may or may not span multiple lines?

Tags:

python

regex

I have a document that when converted to text splits the phone number onto multiple lines like this:

(xxx)-xxx-
xxxx

For a variety of reasons related to my project I can't simply join the lines.

If I know the phonenumber="(555)-555-5555" how can I compile a regex so that if I run it over

(555)-555- 5555

it will match?

**EDIT

To help clarify my question here it is in a more abstract form.

test_string = "xxxx xx x xxxx"
text = """xxxx xx
x
xxxx"""

I need the test string to be found in the text. Newlines can be anywhere in the text and characters that need to be escaped should be taken into consideration.

like image 684
FlashBanistan Avatar asked Nov 18 '25 09:11

FlashBanistan


1 Answers

A simple workaround would be to replace all the \n characters in the document text before you search it:

pat = re.compile(r'\(\d{3}\)-\d{3}\d{4}')
numbers = pat.findall(text.replace('\n',''))

# ['(555)-555-5555']

If this cannot be done for any reasons, the obvious answer, though unsightly, would be to handle a newline character between each search character:

pat = re.compile(r'\(\n*5\n*5\n*5\n*\)\n*-\n*5\n*5\n*5\n*-\n*5\n*5\n*5\n*5')

If you needed to handle any format, you can pad the format like so:

phonenumber = '(555)-555-5555'
pat = re.compile('\n*'.join(['\\'+i if not i.isalnum() else i for i in phonenumber]))

# pat 
# re.compile(r'\(\n*5\n*5\n*5\n*\)\n*\-\n*5\n*5\n*5\n*\-\n*5\n*5\n*5\n*5', re.UNICODE)

Test case:

import random
def rndinsert(s):
    i = random.randrange(len(s)-1)
    return s[:i] + '\n' + s[i:]

for i in range(10):
    print(pat.findall(rndinsert('abc (555)-555-5555 def')))

# ['(555)-555-5555']
# ['(555)-5\n55-5555']
# ['(555)-5\n55-5555']
# ['(555)-555-5555']
# ['(555\n)-555-5555']
# ['(5\n55)-555-5555']
# ['(555)\n-555-5555']
# ['(555)-\n555-5555']
# ['(\n555)-555-5555']
# ['(555)-555-555\n5']
like image 83
r.ook Avatar answered Nov 21 '25 00:11

r.ook