Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

matching any character including newlines in a Python regex subexpression, not globally

Tags:

python

regex

I want to use re.MULTILINE but NOT re.DOTALL, so that I can have a regex that includes both an "any character" wildcard and the normal . wildcard that doesn't match newlines.

Is there a way to do this? What should I use to match any character in those instances that I want to include newlines?

like image 674
Jason S Avatar asked Oct 23 '15 22:10

Jason S


People also ask

How do you match everything including newline regex?

The dot matches all except newlines (\r\n). So use \s\S, which will match ALL characters.

What special character is a wildcard character class that matches any character except newlines?

In a regular expression, a dot (.) is a wildcard that represents any character except the newline. (In awk, dot can even match an embedded newline character.) Given that we are describing a sequence of characters, the wildcard metacharacter allows you to specify a position that any character can fill.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.


2 Answers

To match a newline, or "any symbol" without re.S/re.DOTALL, you may use any of the following:

  1. (?s:.) - the inline modifier group with s flag on sets a scope where all . patterns match any char including line break chars

  2. Any of the following work-arounds:

[\s\S] [\w\W] [\d\D] 

The main idea is that the opposite shorthand classes inside a character class match any symbol there is in the input string.

Comparing it to (.|\s) and other variations with alternation, the character class solution is much more efficient as it involves much less backtracking (when used with a * or + quantifier). Compare the small example: it takes (?:.|\n)+ 45 steps to complete, and it takes [\s\S]+ just 2 steps.

See a Python demo where I am matching a line starting with 123 and up to the first occurrence of 3 at the start of a line and including the rest of that line:

import re text = """abc 123 def 356 more text...""" print( re.findall(r"^123(?s:.*?)^3.*", text, re.M) ) # => ['123\ndef\n356'] print( re.findall(r"^123[\w\W]*?^3.*", text, re.M) ) # => ['123\ndef\n356'] 
like image 176
Wiktor Stribiżew Avatar answered Oct 06 '22 13:10

Wiktor Stribiżew


Match any character (including new line):

Regular Expression: (Note the use of space ' ' is also there)

[\S\n\t\v ] 

Example:

import re  text = 'abc def ###A quick brown fox.\nIt jumps over the lazy dog### ghi jkl' # We want to extract "A quick brown fox.\nIt jumps over the lazy dog" matches = re.findall('###[\S\n ]+###', text) print(matches[0]) 

The 'matches[0]' will contain:
'A quick brown fox.\nIt jumps over the lazy dog'

Description of '\S' Python docs:

\S Matches any character which is not a whitespace character.

( See: https://docs.python.org/3/library/re.html#regular-expression-syntax )

like image 41
Ali Sajjad Avatar answered Oct 06 '22 13:10

Ali Sajjad