Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: don't match string ending with newline (\n) with end-of-line anchor ($)

I can't figure out how to match a string but not if it has a trailing newline character (\n), which seems automatically stripped:

import re

print(re.match(r'^foobar$', 'foobar'))
# <_sre.SRE_Match object; span=(0, 6), match='foobar'>

print(re.match(r'^foobar$', 'foobar\n'))
# <_sre.SRE_Match object; span=(0, 6), match='foobar'>

print(re.match(r'^foobar$', 'foobar\n\n'))
# None

For me, the second case should also return None.
When we set the end of a pattern with $, like ^foobar$, it should only match a string like foobar, not foobar\n.

What am I missing?

like image 481
Arthur White Avatar asked Feb 11 '18 10:02

Arthur White


People also ask

How do you define the regex string pattern that will match the end of the line?

To match the start or the end of a line, we use the following anchors: Caret (^) matches the position before the first character in the string. Dollar ($) matches the position right after the last character in the string.

How do you end a regular expression in regex?

The correct regex to use is ^\d+$. Because “start of string” must be matched before the match of \d+, and “end of string” must be matched right after it, the entire string must consist of digits for ^\d+$ to be able to match.

What is the character used as the end of line anchor for regular expressions?

The caret (^) is the starting anchor, and the dollar sign ($) is the end anchor. The regular expression ^A will match all lines that start with an uppercase A. The expression A$ will match all lines that end with uppercase A.

Does match newline regex?

By default in most regex engines, . doesn't match newline characters, so the matching stops at the end of each logical line. If you want . to match really everything, including newlines, you need to enable "dot-matches-all" mode in your regex engine of choice (for example, add re. DOTALL flag in Python, or /s in PCRE.


3 Answers

You more likely don't need $ but rather \Z:

>>> print(re.match(r'^foobar\Z', 'foobar\n'))
None
  • \Z matches only at the end of the string.
like image 99
revo Avatar answered Oct 16 '22 23:10

revo


This is the defined behavior of $, as can be read in the docs that @zvone linked to or even on https://regex101.com:

$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

You can use an explicit negative lookahead to counter this behavior:

import re

print(re.match(r'^foobar(?!\n)$', 'foobar'))
# <_sre.SRE_Match object; span=(0, 6), match='foobar'>

print(re.match(r'^foobar(?!\n)$', 'foobar\n'))
# None

print(re.match(r'^foobar(?!\n)$', 'foobar\n\n'))
# None
like image 22
DeepSpace Avatar answered Oct 17 '22 00:10

DeepSpace


The documentation says this about the $ character:

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.

So, without the MULTILINE option, it matches exactly the first two strings you tried: 'foobar' and 'foobar\n', but not 'foobar\n\n', because that is not a newline at the end of the string.

On the other hand, if you choose MULTILINE option, it will match the end of any line:

>>> re.match(r'^foobar$', 'foobar\n\n', re.MULTILINE)
<_sre.SRE_Match object; span=(0, 6), match='foobar'>

Of course, this will also match in the following case, which may or may not be what you want:

>>> re.match(r'^foobar$', 'foobar\nanother line\n', re.MULTILINE)
<_sre.SRE_Match object; span=(0, 6), match='foobar'>

In order to NOT match the ending newline, use the negative lookahead as DeepSpace wrote.

like image 3
zvone Avatar answered Oct 17 '22 00:10

zvone