I can't figure out how to match a string but not if it has a trailing newline character (<code>\n</code>), which seems automatically stripped: <pre class="prettyprint"><code>import re print(re.match(r'^foobar$', 'foobar')) # <_sre.SRE_Match object; span=(0, 6), match='foobar'> print(re.match(r'^foobar$', 'foobar\n')) # <_sre.SRE_Match object; span=(0, 6), match='foobar'> print(re.match(r'^foobar$', 'foobar\n\n')) # None </code></pre> For me, the second case should also return <code>None</code>. When we set the end of a pattern with <code>$</code>, like <code>^foobar$</code>, it should only match a string like <code>foobar</code>, not <code>foobar\n</code>. What am I missing?

You more likely don't need <code>$</code> but rather <code>\Z</code>: <pre class="prettyprint"><code>>>> print(re.match(r'^foobar\Z', 'foobar\n')) None </code></pre> <ul> <li> <code>\Z</code> matches only at the end of the string.</li> </ul>

This is the defined behavior of <code>$</code>, as can be read in the docs that @zvone linked to or even on https://regex101.com: <blockquote> $ asserts position at the end of the string, or before the line terminator right at the end of the string (if any) </blockquote> You can use an explicit negative lookahead to counter this behavior: <pre class="prettyprint"><code>import re print(re.match(r'^foobar(?!\n)$', 'foobar')) # <_sre.SRE_Match object; span=(0, 6), match='foobar'> print(re.match(r'^foobar(?!\n)$', 'foobar\n')) # None print(re.match(r'^foobar(?!\n)$', 'foobar\n\n')) # None </code></pre>

The documentation says this about the <code>$</code> character: <blockquote> Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. </blockquote> So, without the <code>MULTILINE</code> option, it matches exactly the first two strings you tried: <code>'foobar'</code> and <code>'foobar\n'</code>, but not <code>'foobar\n\n'</code>, because that is not a newline at the end of the string. On the other hand, if you choose <code>MULTILINE</code> option, it will match the end of any line: <pre class="prettyprint"><code>>>> re.match(r'^foobar$', 'foobar\n\n', re.MULTILINE) <_sre.SRE_Match object; span=(0, 6), match='foobar'> </code></pre> Of course, this will also match in the following case, which may or may not be what you want: <pre class="prettyprint"><code>>>> re.match(r'^foobar$', 'foobar\nanother line\n', re.MULTILINE) <_sre.SRE_Match object; span=(0, 6), match='foobar'> </code></pre> In order to NOT match the ending newline, use the negative lookahead as DeepSpace wrote.

Regex: don't match string ending with newline (\n) with end-of-line anchor ($)

Tags:

python

regex

python-3.x

newline

match

I can't figure out how to match a string but not if it has a trailing newline character (\n), which seems automatically stripped:

import re

print(re.match(r'^foobar$', 'foobar'))
# <_sre.SRE_Match object; span=(0, 6), match='foobar'>

print(re.match(r'^foobar$', 'foobar\n'))
# <_sre.SRE_Match object; span=(0, 6), match='foobar'>

print(re.match(r'^foobar$', 'foobar\n\n'))
# None

For me, the second case should also return None.
When we set the end of a pattern with $, like ^foobar$, it should only match a string like foobar, not foobar\n.

What am I missing?

481

asked Feb 11 '18 10:02

Arthur White

3 Answers

You more likely don't need $ but rather \Z:

>>> print(re.match(r'^foobar\Z', 'foobar\n'))
None

\Z matches only at the end of the string.

answered Oct 16 '22 23:10

revo

This is the defined behavior of $, as can be read in the docs that @zvone linked to or even on https://regex101.com:

$ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

You can use an explicit negative lookahead to counter this behavior:

import re

print(re.match(r'^foobar(?!\n)$', 'foobar'))
# <_sre.SRE_Match object; span=(0, 6), match='foobar'>

print(re.match(r'^foobar(?!\n)$', 'foobar\n'))
# None

print(re.match(r'^foobar(?!\n)$', 'foobar\n\n'))
# None

answered Oct 17 '22 00:10

DeepSpace

The documentation says this about the $ character:

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline.

So, without the MULTILINE option, it matches exactly the first two strings you tried: 'foobar' and 'foobar\n', but not 'foobar\n\n', because that is not a newline at the end of the string.

On the other hand, if you choose MULTILINE option, it will match the end of any line:

>>> re.match(r'^foobar$', 'foobar\n\n', re.MULTILINE)
<_sre.SRE_Match object; span=(0, 6), match='foobar'>

Of course, this will also match in the following case, which may or may not be what you want:

>>> re.match(r'^foobar$', 'foobar\nanother line\n', re.MULTILINE)
<_sre.SRE_Match object; span=(0, 6), match='foobar'>

In order to NOT match the ending newline, use the negative lookahead as DeepSpace wrote.