Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx: words with two letters repeated twice (eg. ABpoiuyAB, xnvXYlsdjsdXYmsd)

Tags:

regex

I had two regex tasks to do today -- and I did one properly and failed with the other. the first task was to find -- in a long, long text -- all the words beginning with "F" and ending with a vowel:

(\bf)\w*([euioay]\b)

and it worked perfectly.

the second one is way too difficult for a philology student ;-) I have to find all the words with repeated at least twice two-letter sequences, for example:

  • tatarak is TATArak, "TA" twice;
  • brzozowski is brZOZOwski, "ZO" twice;
  • loremipsrecdks is loREmipsREcdks, "RE" twice;

can I have some help please? thanks in advance ;-)

like image 834
user2204488 Avatar asked Mar 24 '13 15:03

user2204488


People also ask

How do you repeat in regex?

A repeat is an expression that is repeated an arbitrary number of times. An expression followed by '*' can be repeated any number of times, including zero. An expression followed by '+' can be repeated any number of times, but at least once.

What does the regular expression '[ a za z ]' match?

For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.

What does the plus character [+] do in regex?

The plus ( + ) is a quantifier that matches one or more occurrences of the preceding element. The plus is similar to the asterisk ( * ) in that many occurrences are acceptable, but unlike the asterisk in that at least one occurrence is required.

What does star mean in regex?

The asterisk ( * ): The asterisk is known as a repeater symbol, meaning the preceding character can be found 0 or more times. For example, the regular expression ca*t will match the strings ct, cat, caat, caaat, etc.


2 Answers

Let's see:

(\w{2}) matches two letters (or digits/underscore, but let's ignore that) and captures them in group number 1. Then \1 matches whatever was matched by that group. So

\b\w*(\w{2})\w*\1

is what you're looking for (you don't need {2,} because if three letters are repeated, two letters are also repeated. Not checking for more than two makes the regex much more efficient. You can stop matching after the \1 backreference has succeeded).

like image 194
Tim Pietzcker Avatar answered Sep 22 '22 10:09

Tim Pietzcker


This pattern ought to do the trick

\b\w*?(\w{2})\w*?\1\w*?\b
  • \b is a word boundry
  • \w*? some number of letters (lazily)
  • (w{2}) exactly two letters, match and capture
  • \w*? same as above
  • \1 the content of our two letter capture group
  • \w*? same as above
  • \b another word boundry

A quick test in java:

public static void main(String[] args) {
   final Pattern pattern = Pattern.compile("\\b\\w*?(\\w{2})\\w*?\\1\\w*?\\b");
   final String string = "tatarak brzozowski loremipsrecdks a word that does not match";
   final Matcher matcher = pattern.matcher(string);
   while(matcher.find()) {
       System.out.println("Found group " + matcher.group(1) + " in word " + matcher.group());
   }
}

Output

Found group ta in word tatarak
Found group zo in word brzozowski
Found group re in word loremipsrecdks
like image 44
Boris the Spider Avatar answered Sep 24 '22 10:09

Boris the Spider