Get a match when there are duplicate letters in a string

Question

I have a list of inputs in google sheets,

Input	Desired Output	"To demonstrate only not an input" The repeated letters
Outdoors	Match	o
dog	No Match
step	No Match
bee	Match	e
Chessboard	Match	s
Cookbooks	Match	o, k

How do I verify if all letters are unique in a string without splitting it?

In other words if the string has one letter or more occurred twice or more, return TRUE

My process so far

I tried this solution in addition to splitting the string and dividing the length of the string on the COUNTA of unique letters of the string, if = 1 "Match", else "No match"

Or using regex
I found a method to match a letter is occure in a string 2 times this demonstration with REGEXEXTRACT But wait what needed is get TRUE when the letters are not unique in the string

=REGEXEXTRACT(A1,"o{2}?")

Returns:

oo

Something like this would do

=REGEXMATCH(Input,"(anyletter){2}?")

OR like this

=REGEXMATCH(lower(A6),"[a-zA-Z]{2}?")

Notes

The third column, "Column C," is only for demonstration and not for input.
The match is case insensitive
The string doesn't need to be splitted to aviod heavy calculation "I have long lists"
Avoid using lambda and its helper functions see why?
Its ok to return TRUE or FALSE instead of Match or No Match to keep it simple.

More examples

Input	Desired Output
Professionally	Match
Attractiveness	Match
Uncontrollably	Match
disreputably	No Match
Recommendation	Match
Interrogations	Match
Aggressiveness	Match
doublethinks	No Match

JvdV · Accepted Answer

You are explicitly asking for an answer using a single regular expression. Unfortunately there is no such thing as a backreference to a former capture group using RE2. So if you'd spell out the answer to your problem it would look like:

=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)")))

Since you are looking for case-insensitive matching (?i) modifier will help to cut down the options to just the 26 letters of the alphabet. I suppose the above can be written a bit neater like:

=INDEX(IF(A2:A="","",REGEXMATCH(A2:A,"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")")))

EDIT 1:

The only other reasonable way to do this (untill I learned about the PREG supported syntax of the matches clause in QUERY() by @DoubleUnary) with a single regex other than the above is to create your own UDF in GAS (AFAIK). It's going to be JavaScript based thus supporting a backreferences. GAS is not my forte, but a simple example could be:

function REGEXMATCH_JS(s) {
  if (s.map) {
    return s.map(REGEXMATCH_JS);
  } else {
    return /([a-z]).*?\1/gi.test(s);
  }
}

The pattern ([a-z]).*?\1 means:

([a-z]) - Capture a single character in range a-z;
.*?\1 - Look for 0+ (lazy) characters up to a copy of this 1st captured character with a backreference.

The match is global and case-insensitive. You can now call:

=INDEX(IF(A2:A="","",REGEXMATCH_JS(A2:A)))

EDIT 2:

For those that are benchmarking speed, I am not testing this myself but maybe this would speed things up:

=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:a.*a|b.*b|c.*c|d.*d|e.*e|f.*f|g.*g|h.*h|i.*i|j.*j|k.*k|l.*l|m.*m|n.*n|o.*o|p.*p|q.*q|r.*r|s.*s|t.*t|u.*u|v.*v|w.*w|x.*x|y.*y|z.*z)"))

Or:

=INDEX(REGEXMATCH(A2:INDEX(A:A,COUNTA(A:A)),"(?i)(?:"&TEXTJOIN("|",1,REPLACE(REPT(CHAR(SEQUENCE(26,1,65)),2),2,0,".*"))&")"))

Or:

=REGEXMATCH_JS(A2:INDEX(A:A,COUNTA(A:A)))

Respectively. Knowing there is a header in 1st row.

TheMaster · Answer

Benchmark:

Created a benchmark here.

Methodology:

Use NOW() to create a timestamp, when checkbox is clicked.
Use NOW() to create another timestamp, when the last row is filled and the checkbox is on.
The difference between those two timestamps gives time taken for the formula to complete.
The sample is a random data created from Math.random between [A-Za-z] with 10 characters per word.

Results:

Formula	Round1	Round2	Avg	% Slower than best
Sample size	10006
[re2](a.a\|b.b)JvDv	0:00:19	0:00:19	0:00:19	-15.15%
[re2+recursion]MASTERMATCH_RE2	0:00:27	0:00:24	0:00:26	-54.55%
[Find+recursion]MASTERMATCH	0:00:17	0:00:16	0:00:17	0.00%
[PREG]Doubleunary	0:00:57	0:00:53	0:00:55	-233.33%

Conclusion:

This varies greatly based on browser/device/mobile app and on non-randomized sample data. But I found PREG to be consistently slower than re2

Use recursion.

This seems extremely faster than the regex based approach. Create a named function:

Name:

MASTERMATCH

Arguments(in this order):

word

The word to check

start

Starting at

Function:

=IF(
  MID(word,start,1)="",
  FALSE,
  IF(
    ISERROR(FIND(MID(word,start,1),word,start+1)),
    MASTERMATCH(word,start+1),
    TRUE
  )
)

Usage:

=ARRAYFORMULA(MASTERMATCH(A2:INDEX(A2:A,COUNTA(A2:A)),1))

Or without case sensitivity

=ARRAYFORMULA(MASTERMATCH(lower(A2:A),1))

Explanation:

It recurses through each character using MID and checks whether the same character is available after this position using FIND. If so, returns true and doesn't check anymore. If not, keeps checking until the last character using recursion.

Or with regex, Create a named function:

Name:

MASTERMATCH_RE2

Arguments(in this order):

word

The word to check

start

Starting at

Function:

IF(
  MID(word,start,1)="",
  FALSE,
  IF(
    REGEXMATCH(word,MID(word, start, 1)&"(?i).*"&MID(word,start,1)),
    TRUE,
    MASTERMATCH_RE2(word,start+1)
  )
)

Usage:

=ARRAYFORMULA(MASTERMATCH_RE2(A2:A,1))

Or

=ARRAYFORMULA(MASTERMATCH_RE2(lower(A2:A),1))

Explanation:

It recurses through each character and creates a regex for that character. Instead of a.*a, b.*b,..., it takes the first character(using MID), eg: o in outdoor and creates a regex o.*o. If regex is positive for that regex (using REGEXMATCH), returns true and doesn't check for other letters or create other regexes.

Uses lambda, but it's efficient. Loop through each row and every character with MAP and REDUCE. REPLACE each character in the word and find the difference in length. If more than 1, don't check length anymore and return Match

=MAP(
  A2:INDEX(A2:A,COUNTA(A2:A)),
  LAMBDA(_,
    REDUCE(
      "No Match",
      SEQUENCE(LEN(_)),
      LAMBDA(a,c,
        IF(a="Match",a,
          IF(
            LEN(_)-LEN(
              REGEXREPLACE(_,"(?i)"&MID(_,c,1),)
            )>1,
            "Match",a
          )
        )
      )
    )
  )
)

If you do run into lambda limitations, remove the MAP and drag fill the REDUCE formula.

=REDUCE("No Match",SEQUENCE(LEN(A2)),LAMBDA(a,c,IF(a="Match",a,IF(LEN(A2)-LEN(REGEXREPLACE(A2, "(?i)"&MID(A2,c,1),))>1,"Match",a))))

The latter is preferred for conditional formatting as well.

Get a match when there are duplicate letters in a string

Tags:

regex

google-sheets

google-sheets-formula

re2

Osm

2 Answers

JvdV

Benchmark:

Methodology:

Results:

Conclusion:

Name:

Arguments(in this order):

Function:

Usage:

Explanation:

Name:

Arguments(in this order):

Function:

Usage:

Explanation:

TheMaster

Recent Activity

Donate For Us

Get a match when there are duplicate letters in a string

Tags:

regex

google-sheets

google-sheets-formula

re2

Osm

2 Answers

JvdV

Benchmark:

Methodology:

Results:

Conclusion:

Name:

Arguments(in this order):

Function:

Usage:

Explanation:

Name:

Arguments(in this order):

Function:

Usage:

Explanation:

TheMaster

Related questions

Recent Activity

Donate For Us