Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expression Matching First Non-Repeated Character

TL;DR

re.search("(.)(?!.*\1)", text).group() doesn't match the first non-repeating character contained in text (it always returns a character at or before the first non-repeated character, or before the end of the string if there are no non-repeated characters. My understanding is that re.search() should return None if there were no matches). I'm only interested in understanding why this regex is not working as intended using the Python re module, not in any other method of solving the problem

Full Background

The problem description comes from https://www.codeeval.com/open_challenges/12/. I've already solved this problem using a non-regex method, but revisited it to expand my understanding of Python's re module. The regular expressions i thought would work (named vs unnamed backreferences) are:

(?P<letter>.)(?!.*(?P=letter)) and (.)(?!.*\1) (same results in python2 and python3)

My entire program looks like this

import re import sys with open(sys.argv[1], 'r') as test_cases:     for test in test_cases:         print(re.search("(?P<letter>.)(?!.*(?P=letter))",                         test.strip()                        ).group()              ) 

and some input/output pairs are:

rain | r teetthing | e cardiff | c kangaroo | k god | g newtown | e taxation | x refurbished | f substantially | u 

According to what I've read at https://docs.python.org/2/library/re.html:

  • (.) creates a named group that matches any character and allows later backreferences to it as \1.
  • (?!...) is a negative lookahead which restricts matches to cases where ... does not match.
  • .*\1 means any number (including zero) of characters followed by whatever was matched by (.) earlier
  • re.search(pattern, string) returns only the first location where the regex pattern produces a match (and would return None if no match could be found)
  • .group() is equivalent to .group(0) which returns the entire match

I think these pieces together should solve the stated problem, and it does work like I think it should for most inputs, but failed on teething. Throwing similar problems at it reveals that it seems to ignore repeated characters if they are consecutive:

tooth | o      # fails on consecutive repeated characters aardvark | d   # but does ok if it sees them later aah | a        # verified last one didn't work just because it was at start heh | e        # but it works for this one hehe | h       # What? It thinks h matches (lookahead maybe doesn't find "heh"?) heho | e       # but it definitely finds "heh" and stops "h" from matching here hahah | a      # so now it won't match h but will match a hahxyz | a     # but it realizes there are 2 h characters here... hahxyza | h    # ... Ok time for StackOverflow 

I know lookbehind and negative lookbehind are limited to 3-character-max fixed length strings, and cannot contain backreferences even if they evaluate to a fixed length string, but I didn't see the documentation specify any restrictions on negative lookahead.

like image 590
stevenjackson121 Avatar asked Sep 30 '16 17:09

stevenjackson121


People also ask

How do you find the first non repeated character in the string?

Using the indexOf() and lastIndexOf() method, we can find the first non-repeating character in a string in Java. The method indexOf() returns the position of the first occurrence of a given character in a string whereas method lastIndexOf() returns the position of the last occurrence of a given character in a string.


2 Answers

Well let's take your tooth example - here is what the regex-engine does (a lot simplified for better understanding)

Start with t then look ahead in the string - and fail the lookahead, as there is another t.

tooth ^  ° 

Next take o, look ahead in the string - and fail, as there is another o.

tooth  ^° 

Next take the second o, look ahead in the string - no other o present - match it, return it, work done.

tooth   ^ 

So your regex doesn't match the first unrepeated character, but the first one, that has no further repetitions towards the end of the string.

like image 179
Sebastian Proske Avatar answered Sep 21 '22 22:09

Sebastian Proske


Sebastian's answer already explains pretty well why your current attempt doesn't work.

.NET

Since you're revo is interested in a .NET flavor workaround, the solution becomes trivial:

(?<letter>.)(?!.*?\k<letter>)(?<!\k<letter>.+?) 

Demo link

This works because .NET supports variable-length lookbehinds. You can also get that result with Python (see below).

So for each letter (?<letter>.) we check:

  • if it's repeated further in the input (?!.*?\k<letter>)
  • if it was already encountered before (?<!\k<letter>.+?)
    (we have to skip the letter we're testing when going backwards, hence the +).

Python

The Python regex module also supports variable-length lookbehinds, so the regex above will work with a small syntactical change: you need to replace \k with \g (which is quite unfortunate as with this module \g is a group backreference, whereas with PCRE it's a recursion).

The regex is:

(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?) 

And here's an example:

$ python Python 2.7.10 (default, Jun  1 2015, 18:05:38) [GCC 4.9.2] on cygwin Type "help", "copyright", "credits" or "license" for more information. >>> import regex >>> regex.search(r'(?<letter>.)(?!.*?\g<letter>)(?<!\g<letter>.+?)', 'tooth') <regex.Match object; span=(4, 5), match='h'> 

PCRE

Ok, now things start to get dirty: since PCRE doesn't support variable-length lookbehinds, we need to somehow remember whether a given letter was already encountered in the input or not.

Unfortunately, the regex engine doesn't provide random access memory support. The best we can get in terms of generic memory is a stack - but that's not sufficient for this purpose, as a stack only lets us access its topmost element.

If we accept to restrain ourselves to a given alphabet, we can abuse capturing groups for the purpose of storing flags. Let's see this on a limited alphabet of the three letters abc:

# Anchor the pattern \A  # For each letter, test to see if it's duplicated in the input string (?(?=[^a]*+a[^a]*a)(?<da>)) (?(?=[^b]*+b[^b]*b)(?<db>)) (?(?=[^c]*+c[^c]*c)(?<dc>))  # Skip any duplicated letter and throw it away [a-c]*?\K  # Check if the next letter is a duplicate (?:   (?(da)(*FAIL)|a) | (?(db)(*FAIL)|b) | (?(dc)(*FAIL)|c) ) 

Here's how that works:

  • First, the \A anchor ensures we'll process the input string only once
  • Then, for each letter X of our alphabet, we'll set up a is duplicate flag dX:
    • The conditional pattern (?(cond)then|else) is used there:
      • The condition is (?=[^X]*+X[^X]*X) which is true if the input string contains the letter X twice.
      • If the condition is true, the then clause is (?<dX>), which is an empty capturing group that will match the empty string.
      • If the condition is false, the dX group won't be matched
    • Next, we lazily skip valid letters from our alphabet: [a-c]*?
    • And we throw them out in the final match with \K
    • Now, we're trying to match one letter whose dX flag is not set. For this purpose, we'll do a conditional branch: (?(dX)(*FAIL)|X)
      • If dX was matched (meaning that X is a duplicated character), we (*FAIL), forcing the engine to backtrack and try a different letter.
      • If dX was not matched, we try to match X. At this point, if this succeeds, we know that X is the first non-duplicated letter.

That last part of the pattern could also be replaced with:

(?:   a (*THEN) (?(da)(*FAIL)) | b (*THEN) (?(db)(*FAIL)) | c (*THEN) (?(dc)(*FAIL)) ) 

Which is somewhat more optimized. It matches the current letter first and only then checks if it's a duplicate.

The full pattern for the lowercase letters a-z looks like this:

# Anchor the pattern \A  # For each letter, test to see if it's duplicated in the input string (?(?=[^a]*+a[^a]*a)(?<da>)) (?(?=[^b]*+b[^b]*b)(?<db>)) (?(?=[^c]*+c[^c]*c)(?<dc>)) (?(?=[^d]*+d[^d]*d)(?<dd>)) (?(?=[^e]*+e[^e]*e)(?<de>)) (?(?=[^f]*+f[^f]*f)(?<df>)) (?(?=[^g]*+g[^g]*g)(?<dg>)) (?(?=[^h]*+h[^h]*h)(?<dh>)) (?(?=[^i]*+i[^i]*i)(?<di>)) (?(?=[^j]*+j[^j]*j)(?<dj>)) (?(?=[^k]*+k[^k]*k)(?<dk>)) (?(?=[^l]*+l[^l]*l)(?<dl>)) (?(?=[^m]*+m[^m]*m)(?<dm>)) (?(?=[^n]*+n[^n]*n)(?<dn>)) (?(?=[^o]*+o[^o]*o)(?<do>)) (?(?=[^p]*+p[^p]*p)(?<dp>)) (?(?=[^q]*+q[^q]*q)(?<dq>)) (?(?=[^r]*+r[^r]*r)(?<dr>)) (?(?=[^s]*+s[^s]*s)(?<ds>)) (?(?=[^t]*+t[^t]*t)(?<dt>)) (?(?=[^u]*+u[^u]*u)(?<du>)) (?(?=[^v]*+v[^v]*v)(?<dv>)) (?(?=[^w]*+w[^w]*w)(?<dw>)) (?(?=[^x]*+x[^x]*x)(?<dx>)) (?(?=[^y]*+y[^y]*y)(?<dy>)) (?(?=[^z]*+z[^z]*z)(?<dz>))  # Skip any duplicated letter and throw it away [a-z]*?\K  # Check if the next letter is a duplicate (?:   a (*THEN) (?(da)(*FAIL)) | b (*THEN) (?(db)(*FAIL)) | c (*THEN) (?(dc)(*FAIL)) | d (*THEN) (?(dd)(*FAIL)) | e (*THEN) (?(de)(*FAIL)) | f (*THEN) (?(df)(*FAIL)) | g (*THEN) (?(dg)(*FAIL)) | h (*THEN) (?(dh)(*FAIL)) | i (*THEN) (?(di)(*FAIL)) | j (*THEN) (?(dj)(*FAIL)) | k (*THEN) (?(dk)(*FAIL)) | l (*THEN) (?(dl)(*FAIL)) | m (*THEN) (?(dm)(*FAIL)) | n (*THEN) (?(dn)(*FAIL)) | o (*THEN) (?(do)(*FAIL)) | p (*THEN) (?(dp)(*FAIL)) | q (*THEN) (?(dq)(*FAIL)) | r (*THEN) (?(dr)(*FAIL)) | s (*THEN) (?(ds)(*FAIL)) | t (*THEN) (?(dt)(*FAIL)) | u (*THEN) (?(du)(*FAIL)) | v (*THEN) (?(dv)(*FAIL)) | w (*THEN) (?(dw)(*FAIL)) | x (*THEN) (?(dx)(*FAIL)) | y (*THEN) (?(dy)(*FAIL)) | z (*THEN) (?(dz)(*FAIL)) ) 

And here's the demo on regex101, complete with unit tests.

You can expand on this pattern if you need a larger alphabet, but obviously this is not a general-purpose solution. It's primarily of educational interest and should not be used for any serious application.


For other flavors, you may try to tweak the pattern to replace PCRE features with simpler equivalents:

  • \A becomes ^
  • X (*THEN) (?(dX)(*FAIL)) can be replaced with (?(dX)(?!)|X)
  • You may throw away the \K and replace the last noncapturnig group (?:...) with a named group like (?<letter>...) and treat its content as the result.

The only required but somewhat unusual construct is the conditional group (?(cond)then|else).

like image 20
Lucas Trzesniewski Avatar answered Sep 22 '22 22:09

Lucas Trzesniewski