Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex: finding words that end with the same letter the next word begins with

Tags:

c#

regex

I tried to get regex to work but couldn't (probably because i'm fairly new to regex).

Here's what i want to do:

Consider this text: One word, duel. Limes said bye.

Wanted matches: One word, duel. Limes said bye.

As mentioned previously in the title, i want to get consecutive words matched, one ending with (for example) with "t" and the other one starting with "t" as well, case insensitive.

The closest i got to the answer is with this expression [^a-z][a-z]*([a-z])[^a-z]+\1[a-z]*([a-z])[^a-z]+\2[a-z]*[^a-z]

like image 493
nisser Avatar asked Dec 11 '25 15:12

nisser


2 Answers

You may use

(?i)\b(?<w>\p{L}+)(?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+\b

See the regex demo. The results are in Group "w" capture collection.

Details

  • \b - a word boundary
  • (?<w>\p{L}+) - Group "w" (word): 1 or more BMP Unicode letters
  • (?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+ - 1 or more repetitions of
    • \P{L}+ - 1 or more chars other than BMP Unicode letters
    • (?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*) - Group "w":
      • (\p{L}) - a letter captured into Group 1
      • (?<=\1\P{L}+\1) - immediately to the left of the current position, there must be the same letter as captured in Group 1, 1+ chars other than letters, and the letter in Group 1
      • \p{L}* - 0 or more letters
  • \b - a word boundary.

enter image description here

C# code demo:

var text = "One word, duel. Limes said bye.";
var pattern = @"\b(?<w>\p{L}+)(?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+\b";
var result = Regex.Match(text, pattern, RegexOptions.IgnoreCase)?.Groups["w"].Captures
        .Cast<Capture>()
        .Select(x => x.Value);
Console.WriteLine(string.Join(", ", result)); // => word, duel, Limes, said

A C# demo version without using LINQ:

string text = "One word, duel. Limes said bye.";
string pattern = @"\b(?<w>\p{L}+)(?:\P{L}+(?<w>(\p{L})(?<=\1\P{L}+\1)\p{L}*))+\b";
Match result = Regex.Match(text, pattern, RegexOptions.IgnoreCase);
List<string> output = new List<string>();
if (result.Success) 
{
    foreach (Capture c in result.Groups["w"].Captures)
        output.Add(c.Value);
}
Console.WriteLine(string.Join(", ", output));
like image 50
Wiktor Stribiżew Avatar answered Dec 13 '25 06:12

Wiktor Stribiżew


If a word consists of at least 2 characters a-z, you might use 2 capturing groups with an alternation in a positive lookahead to check if the next word starts with the last char or if the previous word ended and the current word starts with the last char.

With case insensitive match enabled:

\b([a-z])[a-z]*([a-z])\b(?:(?=[,.]? \2)|(?<=\1 \1[a-z]+))
  • \b Word boundary
  • ([a-z]) Capture group 1 Match a-z
  • [a-z]* Match 0+ times a-z in between
  • ([a-z]) Capture group 2 Match a-z
  • \b Word boundary
  • (?: Non capturing group
    • (?= Positive lookahead, assert what is on the right is
      • [,.]? \2 an optional . or , space and what is captured in group 2
    • ) Close lookahead
    • | Or
    • (?<= Positive lookbehind, assert what is on the left is
      • \1 \1[a-z]+ Match what is captured in group 1 and space and 1+ times a char a-z
    • ) Close lookbehind
  • ) Close non capturing group

Regex demo

Note that matching [a-zA-Z] is a small range for a word. You might use \w or \p{L} instead.

enter image description here

like image 20
The fourth bird Avatar answered Dec 13 '25 06:12

The fourth bird



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!