TL;DR <code>re.search("(.)(?!.*\1)", text).group()</code> doesn't match the first non-repeating character contained in text (it always returns a character at or before the first non-repeated character, or before the end of the string if there are no non-repeated characters. My understanding is that re.search() should return None if there were no matches). I'm only interested in understanding why this regex is not working as intended using the Python <code>re</code> module, not in any other method of solving the problem Full Background The problem description comes from https://www.codeeval.com/open_challenges/12/. I've already solved this problem using a non-regex method, but revisited it to expand my understanding of Python's <code>re</code> module. The regular expressions i thought would work (named vs unnamed backreferences) are: <code>(?P<letter>.)(?!.*(?P=letter))</code> and <code>(.)(?!.*\1)</code> (same results in python2 and python3) My entire program looks like this <pre class="prettyprint"><code>import re import sys with open(sys.argv[1], 'r') as test_cases: for test in test_cases: print(re.search("(?P<letter>.)(?!.*(?P=letter))", test.strip() ).group() ) </code></pre> and some input/output pairs are: <pre class="prettyprint"><code>rain | r teetthing | e cardiff | c kangaroo | k god | g newtown | e taxation | x refurbished | f substantially | u </code></pre> According to what I've read at https://docs.python.org/2/library/re.html: <ul> <li> <code>(.)</code> creates a named group that matches any character and allows later backreferences to it as <code>\1</code>. </li> <li> <code>(?!...)</code> is a negative lookahead which restricts matches to cases where <code>...</code> does not match.</li> <li> <code>.*\1</code> means any number (including zero) of characters followed by whatever was matched by <code>(.)</code> earlier</li> <li> <code>re.search(pattern, string)</code> returns only the first location where the regex pattern produces a match (and would return None if no match could be found)</li> <li> <code>.group()</code> is equivalent to <code>.group(0)</code> which returns the entire match</li> </ul> I think these pieces together should solve the stated problem, and it does work like I think it should for most inputs, but failed on <code>teething</code>. Throwing similar problems at it reveals that it seems to ignore repeated characters if they are consecutive: <pre class="prettyprint"><code>tooth | o # fails on consecutive repeated characters aardvark | d # but does ok if it sees them later aah | a # verified last one didn't work just because it was at start heh | e # but it works for this one hehe | h # What? It thinks h matches (lookahead maybe doesn't find "heh"?) heho | e # but it definitely finds "heh" and stops "h" from matching here hahah | a # so now it won't match h but will match a hahxyz | a # but it realizes there are 2 h characters here... hahxyza | h # ... Ok time for StackOverflow </code></pre> I know lookbehind and negative lookbehind are limited to 3-character-max fixed length strings, and cannot contain backreferences even if they evaluate to a fixed length string, but I didn't see the documentation specify any restrictions on negative lookahead.

Well let's take your <code>tooth</code> example - here is what the regex-engine does (a lot simplified for better understanding) Start with <code>t</code> then look ahead in the string - and fail the lookahead, as there is another <code>t</code>. <pre class="prettyprint"><code>tooth ^ ° </code></pre> Next take <code>o</code>, look ahead in the string - and fail, as there is another <code>o</code>. <pre class="prettyprint"><code>tooth ^° </code></pre> Next take the second <code>o</code>, look ahead in the string - no other <code>o</code> present - match it, return it, work done. <pre class="prettyprint"><code>tooth ^ </code></pre> So your regex doesn't match the first unrepeated character, but the first one, that has no further repetitions towards the end of the string.

Sebastian's answer already explains pretty well why your current attempt doesn't work. <h3>.NET</h3> Since <s>you're</s> revo is interested in a .NET flavor workaround, the solution becomes trivial: <pre class="prettyprint lang-none prettyprint-override"><code>(?<letter>.)(?!.*?\k<letter>)(?<!\k<letter>.+?) </code></pre> Demo link This works because .NET supports variable-length lookbehinds. You can also get that result with Python (see below). So for each letter <code>(?<letter>.)</code> we check: <ul> <li>if it's repeated further in the input <code>(?!.*?\k<letter>)</code> </li> <li>if it was already encountered before <code>(?<!\k<letter>.+?)</code> (we have to skip the letter we're testing when going backwards, hence the <code>+</code>).</li> </ul> <hr> <h3>Python</h3> The Python regex module also supports variable-length lookbehinds, so the regex above will work with a small syntactical change: you need to replace <code>\k</code> with <code>\g</code> (which is quite unfortunate as with this module <code>\g</code> is a group backreference, whereas with PCRE it's a recursion). The regex is: <pre class="prettyprint lang-none prettyprint-override"><code>(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?) </code></pre> And here's an example: <pre class="prettyprint lang-none prettyprint-override"><code>$ python Python 2.7.10 (default, Jun 1 2015, 18:05:38) [GCC 4.9.2] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> import regex >>> regex.search(r'(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)', 'tooth') <regex.Match object; span=(4, 5), match='h'> </code></pre> <hr> <h3>PCRE</h3> Ok, now things start to get dirty: since PCRE doesn't support variable-length lookbehinds, we need to somehow remember whether a given letter was already encountered in the input or not. Unfortunately, the regex engine doesn't provide random access memory support. The best we can get in terms of generic memory is a stack - but that's not sufficient for this purpose, as a stack only lets us access its topmost element. If we accept to restrain ourselves to a given alphabet, we can abuse capturing groups for the purpose of storing flags. Let's see this on a limited alphabet of the three letters <code>abc</code>: <pre class="prettyprint lang-none prettyprint-override"><code># Anchor the pattern \A # For each letter, test to see if it's duplicated in the input string (?(?=[^a]*+a[^a]*a)(?<da>)) (?(?=[^b]*+b[^b]*b)(?<db>)) (?(?=[^c]*+c[^c]*c)(?<dc>)) # Skip any duplicated letter and throw it away [a-c]*?\K # Check if the next letter is a duplicate (?: (?(da)(*FAIL)|a) | (?(db)(*FAIL)|b) | (?(dc)(*FAIL)|c) ) </code></pre> Here's how that works: <ul> <li>First, the <code>\A</code> anchor ensures we'll process the input string only once</li> <li>Then, for each letter <code>X</code> of our alphabet, we'll set up a is duplicate flag <code>dX</code>: <ul> <li>The conditional pattern <code>(?(cond)then|else)</code> is used there: <ul> <li>The condition is <code>(?=[^X]*+X[^X]*X)</code> which is true if the input string contains the letter <code>X</code> twice.</li> <li>If the condition is true, the then clause is <code>(?<dX>)</code>, which is an empty capturing group that will match the empty string.</li> <li>If the condition is false, the <code>dX</code> group won't be matched</li> </ul> </li> <li>Next, we lazily skip valid letters from our alphabet: <code>[a-c]*?</code> </li> <li>And we throw them out in the final match with <code>\K</code> </li> <li>Now, we're trying to match one letter whose <code>dX</code> flag is not set. For this purpose, we'll do a conditional branch: <code>(?(dX)(*FAIL)|X)</code> <ul> <li>If <code>dX</code> was matched (meaning that <code>X</code> is a duplicated character), we <code>(*FAIL)</code>, forcing the engine to backtrack and try a different letter.</li> <li>If <code>dX</code> was not matched, we try to match <code>X</code>. At this point, if this succeeds, we know that <code>X</code> is the first non-duplicated letter.</li> </ul> </li> </ul> </li> </ul> That last part of the pattern could also be replaced with: <pre class="prettyprint lang-none prettyprint-override"><code>(?: a (*THEN) (?(da)(*FAIL)) | b (*THEN) (?(db)(*FAIL)) | c (*THEN) (?(dc)(*FAIL)) ) </code></pre> Which is somewhat more optimized. It matches the current letter first and only then checks if it's a duplicate. The full pattern for the lowercase letters <code>a-z</code> looks like this: <pre class="prettyprint lang-none prettyprint-override"><code># Anchor the pattern \A # For each letter, test to see if it's duplicated in the input string (?(?=[^a]*+a[^a]*a)(?<da>)) (?(?=[^b]*+b[^b]*b)(?<db>)) (?(?=[^c]*+c[^c]*c)(?<dc>)) (?(?=[^d]*+d[^d]*d)(?<dd>)) (?(?=[^e]*+e[^e]*e)(?<de>)) (?(?=[^f]*+f[^f]*f)(?<df>)) (?(?=[^g]*+g[^g]*g)(?<dg>)) (?(?=[^h]*+h[^h]*h)(?<dh>)) (?(?=[^i]*+i[^i]*i)(?<di>)) (?(?=[^j]*+j[^j]*j)(?<dj>)) (?(?=[^k]*+k[^k]*k)(?<dk>)) (?(?=[^l]*+l[^l]*l)(?<dl>)) (?(?=[^m]*+m[^m]*m)(?<dm>)) (?(?=[^n]*+n[^n]*n)(?<dn>)) (?(?=[^o]*+o[^o]*o)(?<do>)) (?(?=[^p]*+p[^p]*p)(?<dp>)) (?(?=[^q]*+q[^q]*q)(?<dq>)) (?(?=[^r]*+r[^r]*r)(?<dr>)) (?(?=[^s]*+s[^s]*s)(?<ds>)) (?(?=[^t]*+t[^t]*t)(?<dt>)) (?(?=[^u]*+u[^u]*u)(?<du>)) (?(?=[^v]*+v[^v]*v)(?<dv>)) (?(?=[^w]*+w[^w]*w)(?<dw>)) (?(?=[^x]*+x[^x]*x)(?<dx>)) (?(?=[^y]*+y[^y]*y)(?<dy>)) (?(?=[^z]*+z[^z]*z)(?<dz>)) # Skip any duplicated letter and throw it away [a-z]*?\K # Check if the next letter is a duplicate (?: a (*THEN) (?(da)(*FAIL)) | b (*THEN) (?(db)(*FAIL)) | c (*THEN) (?(dc)(*FAIL)) | d (*THEN) (?(dd)(*FAIL)) | e (*THEN) (?(de)(*FAIL)) | f (*THEN) (?(df)(*FAIL)) | g (*THEN) (?(dg)(*FAIL)) | h (*THEN) (?(dh)(*FAIL)) | i (*THEN) (?(di)(*FAIL)) | j (*THEN) (?(dj)(*FAIL)) | k (*THEN) (?(dk)(*FAIL)) | l (*THEN) (?(dl)(*FAIL)) | m (*THEN) (?(dm)(*FAIL)) | n (*THEN) (?(dn)(*FAIL)) | o (*THEN) (?(do)(*FAIL)) | p (*THEN) (?(dp)(*FAIL)) | q (*THEN) (?(dq)(*FAIL)) | r (*THEN) (?(dr)(*FAIL)) | s (*THEN) (?(ds)(*FAIL)) | t (*THEN) (?(dt)(*FAIL)) | u (*THEN) (?(du)(*FAIL)) | v (*THEN) (?(dv)(*FAIL)) | w (*THEN) (?(dw)(*FAIL)) | x (*THEN) (?(dx)(*FAIL)) | y (*THEN) (?(dy)(*FAIL)) | z (*THEN) (?(dz)(*FAIL)) ) </code></pre> And here's the demo on regex101, complete with unit tests. You can expand on this pattern if you need a larger alphabet, but obviously this is not a general-purpose solution. It's primarily of educational interest and should not be used for any serious application. <hr> For other flavors, you may try to tweak the pattern to replace PCRE features with simpler equivalents: <ul> <li> <code>\A</code> becomes <code>^</code> </li> <li> <code>X (*THEN) (?(dX)(*FAIL))</code> can be replaced with <code>(?(dX)(?!)|X)</code> </li> <li>You may throw away the <code>\K</code> and replace the last noncapturnig group <code>(?:</code>...<code>)</code> with a named group like <code>(?<letter></code>...<code>)</code> and treat its content as the result.</li> </ul> The only required but somewhat unusual construct is the conditional group <code>(?(cond)then|else)</code>.

Regular Expression Matching First Non-Repeated Character

Tags:

python

regex

regex-lookarounds

TL;DR

re.search("(.)(?!.*\1)", text).group() doesn't match the first non-repeating character contained in text (it always returns a character at or before the first non-repeated character, or before the end of the string if there are no non-repeated characters. My understanding is that re.search() should return None if there were no matches). I'm only interested in understanding why this regex is not working as intended using the Python re module, not in any other method of solving the problem

Full Background

The problem description comes from https://www.codeeval.com/open_challenges/12/. I've already solved this problem using a non-regex method, but revisited it to expand my understanding of Python's re module. The regular expressions i thought would work (named vs unnamed backreferences) are:

(?P<letter>.)(?!.*(?P=letter)) and (.)(?!.*\1) (same results in python2 and python3)

My entire program looks like this

import re import sys with open(sys.argv[1], 'r') as test_cases:     for test in test_cases:         print(re.search("(?P<letter>.)(?!.*(?P=letter))",                         test.strip()                        ).group()              )

and some input/output pairs are:

rain | r teetthing | e cardiff | c kangaroo | k god | g newtown | e taxation | x refurbished | f substantially | u

According to what I've read at https://docs.python.org/2/library/re.html:

(.) creates a named group that matches any character and allows later backreferences to it as \1.
(?!...) is a negative lookahead which restricts matches to cases where ... does not match.
.*\1 means any number (including zero) of characters followed by whatever was matched by (.) earlier
re.search(pattern, string) returns only the first location where the regex pattern produces a match (and would return None if no match could be found)
.group() is equivalent to .group(0) which returns the entire match

I think these pieces together should solve the stated problem, and it does work like I think it should for most inputs, but failed on teething. Throwing similar problems at it reveals that it seems to ignore repeated characters if they are consecutive:

tooth | o      # fails on consecutive repeated characters aardvark | d   # but does ok if it sees them later aah | a        # verified last one didn't work just because it was at start heh | e        # but it works for this one hehe | h       # What? It thinks h matches (lookahead maybe doesn't find "heh"?) heho | e       # but it definitely finds "heh" and stops "h" from matching here hahah | a      # so now it won't match h but will match a hahxyz | a     # but it realizes there are 2 h characters here... hahxyza | h    # ... Ok time for StackOverflow

I know lookbehind and negative lookbehind are limited to 3-character-max fixed length strings, and cannot contain backreferences even if they evaluate to a fixed length string, but I didn't see the documentation specify any restrictions on negative lookahead.

590

asked Sep 30 '16 17:09

stevenjackson121

2 Answers

Well let's take your tooth example - here is what the regex-engine does (a lot simplified for better understanding)

Start with t then look ahead in the string - and fail the lookahead, as there is another t.

tooth ^  °

Next take o, look ahead in the string - and fail, as there is another o.

tooth  ^°

Next take the second o, look ahead in the string - no other o present - match it, return it, work done.

tooth   ^

So your regex doesn't match the first unrepeated character, but the first one, that has no further repetitions towards the end of the string.

179

answered Sep 21 '22 22:09

Sebastian Proske

Sebastian's answer already explains pretty well why your current attempt doesn't work.

.NET

Since ~~you're~~ revo is interested in a .NET flavor workaround, the solution becomes trivial:

(?<letter>.)(?!.*?\k<letter>)(?<!\k<letter>.+?)

Demo link

This works because .NET supports variable-length lookbehinds. You can also get that result with Python (see below).

So for each letter (?<letter>.) we check:

if it's repeated further in the input (?!.*?\k<letter>)
if it was already encountered before (?<!\k<letter>.+?)
(we have to skip the letter we're testing when going backwards, hence the +).

Python

The Python regex module also supports variable-length lookbehinds, so the regex above will work with a small syntactical change: you need to replace \k with \g (which is quite unfortunate as with this module \g is a group backreference, whereas with PCRE it's a recursion).

The regex is:

(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)

And here's an example:

$ python Python 2.7.10 (default, Jun  1 2015, 18:05:38) [GCC 4.9.2] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> import regex >>> regex.search(r'(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)', 'tooth') <regex.Match object; span=(4, 5), match='h'>

PCRE

Ok, now things start to get dirty: since PCRE doesn't support variable-length lookbehinds, we need to somehow remember whether a given letter was already encountered in the input or not.

Unfortunately, the regex engine doesn't provide random access memory support. The best we can get in terms of generic memory is a stack - but that's not sufficient for this purpose, as a stack only lets us access its topmost element.

If we accept to restrain ourselves to a given alphabet, we can abuse capturing groups for the purpose of storing flags. Let's see this on a limited alphabet of the three letters abc:

# Anchor the pattern \A  # For each letter, test to see if it's duplicated in the input string (?(?=[^a]*+a[^a]*a)(?<da>)) (?(?=[^b]*+b[^b]*b)(?<db>)) (?(?=[^c]*+c[^c]*c)(?<dc>))  # Skip any duplicated letter and throw it away [a-c]*?\K  # Check if the next letter is a duplicate (?:   (?(da)(*FAIL)|a) | (?(db)(*FAIL)|b) | (?(dc)(*FAIL)|c) )

Here's how that works:

First, the \A anchor ensures we'll process the input string only once
Then, for each letter X of our alphabet, we'll set up a is duplicate flag dX:
- The conditional pattern (?(cond)then|else) is used there:
 - The condition is (?=[^X]*+X[^X]*X) which is true if the input string contains the letter X twice.
 - If the condition is true, the then clause is (?<dX>), which is an empty capturing group that will match the empty string.
 - If the condition is false, the dX group won't be matched
- Next, we lazily skip valid letters from our alphabet: [a-c]*?
- And we throw them out in the final match with \K
- Now, we're trying to match one letter whose dX flag is not set. For this purpose, we'll do a conditional branch: (?(dX)(*FAIL)|X)
 - If dX was matched (meaning that X is a duplicated character), we (*FAIL), forcing the engine to backtrack and try a different letter.
 - If dX was not matched, we try to match X. At this point, if this succeeds, we know that X is the first non-duplicated letter.

That last part of the pattern could also be replaced with:

(?:   a (*THEN) (?(da)(*FAIL)) | b (*THEN) (?(db)(*FAIL)) | c (*THEN) (?(dc)(*FAIL)) )

Which is somewhat more optimized. It matches the current letter first and only then checks if it's a duplicate.

The full pattern for the lowercase letters a-z looks like this:

# Anchor the pattern \A  # For each letter, test to see if it's duplicated in the input string (?(?=[^a]*+a[^a]*a)(?<da>)) (?(?=[^b]*+b[^b]*b)(?<db>)) (?(?=[^c]*+c[^c]*c)(?<dc>)) (?(?=[^d]*+d[^d]*d)(?<dd>)) (?(?=[^e]*+e[^e]*e)(?<de>)) (?(?=[^f]*+f[^f]*f)(?<df>)) (?(?=[^g]*+g[^g]*g)(?<dg>)) (?(?=[^h]*+h[^h]*h)(?<dh>)) (?(?=[^i]*+i[^i]*i)(?<di>)) (?(?=[^j]*+j[^j]*j)(?<dj>)) (?(?=[^k]*+k[^k]*k)(?<dk>)) (?(?=[^l]*+l[^l]*l)(?<dl>)) (?(?=[^m]*+m[^m]*m)(?<dm>)) (?(?=[^n]*+n[^n]*n)(?<dn>)) (?(?=[^o]*+o[^o]*o)(?<do>)) (?(?=[^p]*+p[^p]*p)(?<dp>)) (?(?=[^q]*+q[^q]*q)(?<dq>)) (?(?=[^r]*+r[^r]*r)(?<dr>)) (?(?=[^s]*+s[^s]*s)(?<ds>)) (?(?=[^t]*+t[^t]*t)(?<dt>)) (?(?=[^u]*+u[^u]*u)(?<du>)) (?(?=[^v]*+v[^v]*v)(?<dv>)) (?(?=[^w]*+w[^w]*w)(?<dw>)) (?(?=[^x]*+x[^x]*x)(?<dx>)) (?(?=[^y]*+y[^y]*y)(?<dy>)) (?(?=[^z]*+z[^z]*z)(?<dz>))  # Skip any duplicated letter and throw it away [a-z]*?\K  # Check if the next letter is a duplicate (?:   a (*THEN) (?(da)(*FAIL)) | b (*THEN) (?(db)(*FAIL)) | c (*THEN) (?(dc)(*FAIL)) | d (*THEN) (?(dd)(*FAIL)) | e (*THEN) (?(de)(*FAIL)) | f (*THEN) (?(df)(*FAIL)) | g (*THEN) (?(dg)(*FAIL)) | h (*THEN) (?(dh)(*FAIL)) | i (*THEN) (?(di)(*FAIL)) | j (*THEN) (?(dj)(*FAIL)) | k (*THEN) (?(dk)(*FAIL)) | l (*THEN) (?(dl)(*FAIL)) | m (*THEN) (?(dm)(*FAIL)) | n (*THEN) (?(dn)(*FAIL)) | o (*THEN) (?(do)(*FAIL)) | p (*THEN) (?(dp)(*FAIL)) | q (*THEN) (?(dq)(*FAIL)) | r (*THEN) (?(dr)(*FAIL)) | s (*THEN) (?(ds)(*FAIL)) | t (*THEN) (?(dt)(*FAIL)) | u (*THEN) (?(du)(*FAIL)) | v (*THEN) (?(dv)(*FAIL)) | w (*THEN) (?(dw)(*FAIL)) | x (*THEN) (?(dx)(*FAIL)) | y (*THEN) (?(dy)(*FAIL)) | z (*THEN) (?(dz)(*FAIL)) )

And here's the demo on regex101, complete with unit tests.

You can expand on this pattern if you need a larger alphabet, but obviously this is not a general-purpose solution. It's primarily of educational interest and should not be used for any serious application.

For other flavors, you may try to tweak the pattern to replace PCRE features with simpler equivalents:

\A becomes ^
X (*THEN) (?(dX)(*FAIL)) can be replaced with (?(dX)(?!)|X)
You may throw away the \K and replace the last noncapturnig group (?:...) with a named group like (?<letter>...) and treat its content as the result.

The only required but somewhat unusual construct is the conditional group (?(cond)then|else).

answered Sep 22 '22 22:09

Lucas Trzesniewski

Related questions
                            
                                algorithm for python itertools.permutations
                            
                                How to clear the whole cache when using django's page_cache decorator
                            
                                python setup.py sdist only including .py source from top level module
                            
                                Python 2: SMTPServerDisconnected: Connection unexpectedly closed
                            
                                UnicodeEncodeError: 'ascii' codec can't encode character in position 0: ordinal not in range(128)
                            
                                Dependency rule tried to blank out primary key in SQLAlchemy, when foreign key constraint is part of composite primary key
                            
                                ValueError: DataFrame index must be unique for orient='columns'
                            
                                Flask permanent session: where to define them?
                            
                                What are chunks, samples and frames when using pyaudio
                            
                                Number of rows changes even after `pandas.merge` with `left` option
                            
                                Efficient string matching in Apache Spark
                            
                                Is there a way to list the attributes of a class without instantiating an object?
                            
                                collections.Iterable vs typing.Iterable in type annotation and checking for Iterable
                            
                                Index and Slice a Generator in Python
                            
                                What is the difference between StringIO and io.StringIO in Python2.7?
                            
                                Can someone explain this: 0.2 + 0.1 = 0.30000000000000004? [duplicate]
                            
                                How to run functions outside websocket loop in python (tornado)
                            
                                Getting only those values that fulfill a condition in a numpy array
                            
                                How do I apply some function to a python meshgrid?
                            
                                Live stdout output from Python subprocess in Jupyter notebook

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With