I had two regex tasks to do today -- and I did one properly and failed with the other. the first task was to find -- in a long, long text -- all the words beginning with "F" and ending with a vowel:
(\bf)\w*([euioay]\b)
and it worked perfectly.
the second one is way too difficult for a philology student ;-) I have to find all the words with repeated at least twice two-letter sequences, for example:
can I have some help please? thanks in advance ;-)
A repeat is an expression that is repeated an arbitrary number of times. An expression followed by '*' can be repeated any number of times, including zero. An expression followed by '+' can be repeated any number of times, but at least once.
For example, the regular expression "[ A-Za-z] " specifies to match any single uppercase or lowercase letter. In the character set, a hyphen indicates a range of characters, for example [A-Z] will match any one capital letter.
The plus ( + ) is a quantifier that matches one or more occurrences of the preceding element. The plus is similar to the asterisk ( * ) in that many occurrences are acceptable, but unlike the asterisk in that at least one occurrence is required.
The asterisk ( * ): The asterisk is known as a repeater symbol, meaning the preceding character can be found 0 or more times. For example, the regular expression ca*t will match the strings ct, cat, caat, caaat, etc.
Let's see:
(\w{2})
matches two letters (or digits/underscore, but let's ignore that) and captures them in group number 1. Then \1
matches whatever was matched by that group. So
\b\w*(\w{2})\w*\1
is what you're looking for (you don't need {2,}
because if three letters are repeated, two letters are also repeated. Not checking for more than two makes the regex much more efficient. You can stop matching after the \1
backreference has succeeded).
This pattern ought to do the trick
\b\w*?(\w{2})\w*?\1\w*?\b
\b
is a word boundry\w*?
some number of letters (lazily)(w{2})
exactly two letters, match and capture\w*?
same as above\1
the content of our two letter capture group\w*?
same as above\b
another word boundryA quick test in java:
public static void main(String[] args) {
final Pattern pattern = Pattern.compile("\\b\\w*?(\\w{2})\\w*?\\1\\w*?\\b");
final String string = "tatarak brzozowski loremipsrecdks a word that does not match";
final Matcher matcher = pattern.matcher(string);
while(matcher.find()) {
System.out.println("Found group " + matcher.group(1) + " in word " + matcher.group());
}
}
Output
Found group ta in word tatarak
Found group zo in word brzozowski
Found group re in word loremipsrecdks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With