REGEX

Question

How to match any character which repeats n times?

Example:

for input: abcdbcdcdd
for n=1:   ..........
for n=2:    .........
for n=3:     .. .....
for n=4:      .  . ..
for n=5:   no matches

After several hours my best is this expression

(\w)(?=(?:.*\1){n-1,}) //where n is variable

which uses lookahead. However the problem with this expression is this:

for input: abcdbcdcdd
for n=1    .......... 
for n=2     ... .. .
for n=3      ..  .
for n=4       .
for n=5    no matches

As you can see, when lookahead matches for a character, let's look for n=4 line, d's lookahead assertion satisfied and first d matched by regex. But remaining d's are not matched because they don't have 3 more d's ahead of them.

I hope I stated the problem clearly. Hoping for your solutions, thanks in advance.

Mariano · Accepted Answer

let's look for n=4 line, d's lookahead assertion satisfied and first d matched by regex. But remaining d's are not matched because they don't have 3 more d's ahead of them.

And obviously, without regex, this is a very simple string manipulation problem. I'm trying to do this with and only with regex.

As with any regex implementation, the answer depends on the regex flavour. You could create a solution with .net regex engine, because it allows variable width lookbehinds.

Also, I'll provide a more generalized solution below for perl-compatible/like regex flavours.

.net Solution

As @PetSerAl pointed out in his answer, with variable width lookbehinds, we can assert back to the beggining of the string, and check there are n occurrences.
ideone demo

regex module in Python
You can implement this solution in python, using the regex module by Matthew Barnett, which also allows variable-width lookbehinds.

>>> import regex
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){2})\A.*)', 'abcdbcdcdd')
['b', 'c', 'd', 'b', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){3})\A.*)', 'abcdbcdcdd')
['c', 'd', 'c', 'd', 'c', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){4})\A.*)', 'abcdbcdcdd')
['d', 'd', 'd', 'd']
>>> regex.findall( r'(\w)(?<=(?=(?>.*?\1){5})\A.*)', 'abcdbcdcdd')
[]

Generalized Solution

In pcre or any of the "perl-like" flavours, there is no solution that would actually return a match for every repeated character, but we could create one, and only one, capture for each character.

Strategy

For any given n, the logic involves:

Early matches: Match and capture every character followed by at least n more occurences.
Final captures:
- Match and capture a character followed by exactly n-1 occurences, and
- also capture every one of the following occurrences.

Example

for n = 3
input = abcdbcdcdd

The character c is Matched only once (as final), and the following 2 occurrences are also Captured in the same match:

abcdbcdcdd
  M  C C

and the character d is (early) Matched once:

abcdbcdcdd
   M

and (finally) Matched one more time, Capturing the rest:

abcdbcdcdd
      M CC

/(\w)                        # match 1 character
(?:
    (?=(?:.*?\1){≪N≫})     # [1] followed by other ≪N≫ occurrences
  |                          #   OR
    (?=                      # [2] followed by:
        (?:(?!\1).)*(\1)     #      2nd occurence <captured>
        (?:(?!\1).)*(\1)     #      3rd occurence <captured>
        ≪repeat previous≫  #      repeat subpattern (n-1) times
                             #     *exactly (n-1) times*
        (?!.*?\1)            #     not followed by another occurence
    )
)/xg

For n =

/(\w)(?:(?=(?:.*?\1){2})|(?=(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){3})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
/(\w)(?:(?=(?:.*?\1){4})|(?=(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?:(?!\1).)*(\1)(?!.*?\1)))/g
demo
... etc.

Pseudocode to generate the pattern

// Variables: N (int)

character = "(\w)"
early_match = "(?=(?:.*?\1){" + N + "})"

final_match = "(?="
for i = 1; i < N; i++
    final_match += "(?:(?!\1).)*(\1)"
final_match += "(?!.*?\1))"

pattern = character + "(?:" + early_match + "|" + final_match + ")"

JavaScript Code

I'll show an implementation using javascript because we can check the result here (and if it works in javascript, it works in any perl-compatible regex flavour, including .net, java, python, ruby, perl, and all languages that implemented pcre, among others).

var str = 'abcdbcdcdd';
var pattern, re, match, N, i;
var output = "";

// We'll show the results for N = 2, 3 and 4
for (N = 2; N <= 4; N++) {
    // Generate pattern
    pattern = "(\w)(?:(?=(?:.*?\1){" + N + "})|(?=";
    for (i = 1; i < N; i++) {
        pattern += "(?:(?!\1).)*(\1)";
    }
    pattern += "(?!.*?\1)))";
    re = new RegExp(pattern, "g");
    
    output += "<h3>N = " + N + "</h3><pre>Pattern: " + pattern + "
Text: " + str;
    
    // Loop all matches
    while ((match = re.exec(str)) !== null) {
        output += "
Pos: " + match.index + "	Match:";
        // Loop all captures
        x = 1;
        while (match[x] != null) {
            output += " " + match[x];
            x++;
        }
    }
    
    output += "</pre>";
}

document.write(output);

Python3 code

As requested by the OP, I'm linking to a Python3 implementation in ideone.com

REGEX - Matching any character which repeats n times

Tags:

regex

lookahead

repeat

ferit

1 Answers

.net Solution

Generalized Solution

Strategy

Example

Regex

Pseudocode to generate the pattern

JavaScript Code

Python3 code

Mariano

Recent Activity

Donate For Us

REGEX - Matching any character which repeats n times

Tags:

regex

lookahead

repeat

ferit

1 Answers

.net Solution

Generalized Solution

Strategy

Example

Regex

Pseudocode to generate the pattern

JavaScript Code

Python3 code

Mariano

Related questions

Recent Activity

Donate For Us