TL;DR
re.search("(.)(?!.*\1)", text).group()
doesn't match the first non-repeating character contained in text (it always returns a character at or before the first non-repeated character, or before the end of the string if there are no non-repeated characters. My understanding is that re.search() should return None if there were no matches). I'm only interested in understanding why this regex is not working as intended using the Python re
module, not in any other method of solving the problem
Full Background
The problem description comes from https://www.codeeval.com/open_challenges/12/. I've already solved this problem using a non-regex method, but revisited it to expand my understanding of Python's re
module. The regular expressions i thought would work (named vs unnamed backreferences) are:
(?P<letter>.)(?!.*(?P=letter))
and (.)(?!.*\1)
(same results in python2 and python3)
My entire program looks like this
import re import sys with open(sys.argv[1], 'r') as test_cases: for test in test_cases: print(re.search("(?P<letter>.)(?!.*(?P=letter))", test.strip() ).group() )
and some input/output pairs are:
rain | r teetthing | e cardiff | c kangaroo | k god | g newtown | e taxation | x refurbished | f substantially | u
According to what I've read at https://docs.python.org/2/library/re.html:
(.)
creates a named group that matches any character and allows later backreferences to it as \1
. (?!...)
is a negative lookahead which restricts matches to cases where ...
does not match..*\1
means any number (including zero) of characters followed by whatever was matched by (.)
earlierre.search(pattern, string)
returns only the first location where the regex pattern produces a match (and would return None if no match could be found).group()
is equivalent to .group(0)
which returns the entire matchI think these pieces together should solve the stated problem, and it does work like I think it should for most inputs, but failed on teething
. Throwing similar problems at it reveals that it seems to ignore repeated characters if they are consecutive:
tooth | o # fails on consecutive repeated characters aardvark | d # but does ok if it sees them later aah | a # verified last one didn't work just because it was at start heh | e # but it works for this one hehe | h # What? It thinks h matches (lookahead maybe doesn't find "heh"?) heho | e # but it definitely finds "heh" and stops "h" from matching here hahah | a # so now it won't match h but will match a hahxyz | a # but it realizes there are 2 h characters here... hahxyza | h # ... Ok time for StackOverflow
I know lookbehind and negative lookbehind are limited to 3-character-max fixed length strings, and cannot contain backreferences even if they evaluate to a fixed length string, but I didn't see the documentation specify any restrictions on negative lookahead.
Using the indexOf() and lastIndexOf() method, we can find the first non-repeating character in a string in Java. The method indexOf() returns the position of the first occurrence of a given character in a string whereas method lastIndexOf() returns the position of the last occurrence of a given character in a string.
Well let's take your tooth
example - here is what the regex-engine does (a lot simplified for better understanding)
Start with t
then look ahead in the string - and fail the lookahead, as there is another t
.
tooth ^ °
Next take o
, look ahead in the string - and fail, as there is another o
.
tooth ^°
Next take the second o
, look ahead in the string - no other o
present - match it, return it, work done.
tooth ^
So your regex doesn't match the first unrepeated character, but the first one, that has no further repetitions towards the end of the string.
Sebastian's answer already explains pretty well why your current attempt doesn't work.
Since you're revo is interested in a .NET flavor workaround, the solution becomes trivial:
(?<letter>.)(?!.*?\k<letter>)(?<!\k<letter>.+?)
Demo link
This works because .NET supports variable-length lookbehinds. You can also get that result with Python (see below).
So for each letter (?<letter>.)
we check:
(?!.*?\k<letter>)
(?<!\k<letter>.+?)
+
).The Python regex module also supports variable-length lookbehinds, so the regex above will work with a small syntactical change: you need to replace \k
with \g
(which is quite unfortunate as with this module \g
is a group backreference, whereas with PCRE it's a recursion).
The regex is:
(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)
And here's an example:
$ python Python 2.7.10 (default, Jun 1 2015, 18:05:38) [GCC 4.9.2] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> import regex >>> regex.search(r'(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)', 'tooth') <regex.Match object; span=(4, 5), match='h'>
Ok, now things start to get dirty: since PCRE doesn't support variable-length lookbehinds, we need to somehow remember whether a given letter was already encountered in the input or not.
Unfortunately, the regex engine doesn't provide random access memory support. The best we can get in terms of generic memory is a stack - but that's not sufficient for this purpose, as a stack only lets us access its topmost element.
If we accept to restrain ourselves to a given alphabet, we can abuse capturing groups for the purpose of storing flags. Let's see this on a limited alphabet of the three letters abc
:
# Anchor the pattern \A # For each letter, test to see if it's duplicated in the input string (?(?=[^a]*+a[^a]*a)(?<da>)) (?(?=[^b]*+b[^b]*b)(?<db>)) (?(?=[^c]*+c[^c]*c)(?<dc>)) # Skip any duplicated letter and throw it away [a-c]*?\K # Check if the next letter is a duplicate (?: (?(da)(*FAIL)|a) | (?(db)(*FAIL)|b) | (?(dc)(*FAIL)|c) )
Here's how that works:
\A
anchor ensures we'll process the input string only onceX
of our alphabet, we'll set up a is duplicate flag dX
: (?(cond)then|else)
is used there: (?=[^X]*+X[^X]*X)
which is true if the input string contains the letter X
twice.(?<dX>)
, which is an empty capturing group that will match the empty string.dX
group won't be matched[a-c]*?
\K
dX
flag is not set. For this purpose, we'll do a conditional branch: (?(dX)(*FAIL)|X)
dX
was matched (meaning that X
is a duplicated character), we (*FAIL)
, forcing the engine to backtrack and try a different letter.dX
was not matched, we try to match X
. At this point, if this succeeds, we know that X
is the first non-duplicated letter.That last part of the pattern could also be replaced with:
(?: a (*THEN) (?(da)(*FAIL)) | b (*THEN) (?(db)(*FAIL)) | c (*THEN) (?(dc)(*FAIL)) )
Which is somewhat more optimized. It matches the current letter first and only then checks if it's a duplicate.
The full pattern for the lowercase letters a-z
looks like this:
# Anchor the pattern \A # For each letter, test to see if it's duplicated in the input string (?(?=[^a]*+a[^a]*a)(?<da>)) (?(?=[^b]*+b[^b]*b)(?<db>)) (?(?=[^c]*+c[^c]*c)(?<dc>)) (?(?=[^d]*+d[^d]*d)(?<dd>)) (?(?=[^e]*+e[^e]*e)(?<de>)) (?(?=[^f]*+f[^f]*f)(?<df>)) (?(?=[^g]*+g[^g]*g)(?<dg>)) (?(?=[^h]*+h[^h]*h)(?<dh>)) (?(?=[^i]*+i[^i]*i)(?<di>)) (?(?=[^j]*+j[^j]*j)(?<dj>)) (?(?=[^k]*+k[^k]*k)(?<dk>)) (?(?=[^l]*+l[^l]*l)(?<dl>)) (?(?=[^m]*+m[^m]*m)(?<dm>)) (?(?=[^n]*+n[^n]*n)(?<dn>)) (?(?=[^o]*+o[^o]*o)(?<do>)) (?(?=[^p]*+p[^p]*p)(?<dp>)) (?(?=[^q]*+q[^q]*q)(?<dq>)) (?(?=[^r]*+r[^r]*r)(?<dr>)) (?(?=[^s]*+s[^s]*s)(?<ds>)) (?(?=[^t]*+t[^t]*t)(?<dt>)) (?(?=[^u]*+u[^u]*u)(?<du>)) (?(?=[^v]*+v[^v]*v)(?<dv>)) (?(?=[^w]*+w[^w]*w)(?<dw>)) (?(?=[^x]*+x[^x]*x)(?<dx>)) (?(?=[^y]*+y[^y]*y)(?<dy>)) (?(?=[^z]*+z[^z]*z)(?<dz>)) # Skip any duplicated letter and throw it away [a-z]*?\K # Check if the next letter is a duplicate (?: a (*THEN) (?(da)(*FAIL)) | b (*THEN) (?(db)(*FAIL)) | c (*THEN) (?(dc)(*FAIL)) | d (*THEN) (?(dd)(*FAIL)) | e (*THEN) (?(de)(*FAIL)) | f (*THEN) (?(df)(*FAIL)) | g (*THEN) (?(dg)(*FAIL)) | h (*THEN) (?(dh)(*FAIL)) | i (*THEN) (?(di)(*FAIL)) | j (*THEN) (?(dj)(*FAIL)) | k (*THEN) (?(dk)(*FAIL)) | l (*THEN) (?(dl)(*FAIL)) | m (*THEN) (?(dm)(*FAIL)) | n (*THEN) (?(dn)(*FAIL)) | o (*THEN) (?(do)(*FAIL)) | p (*THEN) (?(dp)(*FAIL)) | q (*THEN) (?(dq)(*FAIL)) | r (*THEN) (?(dr)(*FAIL)) | s (*THEN) (?(ds)(*FAIL)) | t (*THEN) (?(dt)(*FAIL)) | u (*THEN) (?(du)(*FAIL)) | v (*THEN) (?(dv)(*FAIL)) | w (*THEN) (?(dw)(*FAIL)) | x (*THEN) (?(dx)(*FAIL)) | y (*THEN) (?(dy)(*FAIL)) | z (*THEN) (?(dz)(*FAIL)) )
And here's the demo on regex101, complete with unit tests.
You can expand on this pattern if you need a larger alphabet, but obviously this is not a general-purpose solution. It's primarily of educational interest and should not be used for any serious application.
For other flavors, you may try to tweak the pattern to replace PCRE features with simpler equivalents:
\A
becomes ^
X (*THEN) (?(dX)(*FAIL))
can be replaced with (?(dX)(?!)|X)
\K
and replace the last noncapturnig group (?:
...)
with a named group like (?<letter>
...)
and treat its content as the result.The only required but somewhat unusual construct is the conditional group (?(cond)then|else)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With