Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to find all possible occurrences of text starting and ending with ~

Tags:

java

regex

I would like to find all possible occurrences of text enclosed between two ~s.

For example: For the text ~*_abc~xyz~ ~123~, I want the following expressions as matching patterns:

  1. ~*_abc~
  2. ~xyz~
  3. ~123~

Note it can be an alphabet or a digit.

I tried with the regex ~[\w]+?~ but it is not giving me ~xyz~. I want ~ to be reconsidered. But I don't want just ~~ as a possible match.

like image 826
AbhishekAsh Avatar asked Mar 31 '16 10:03

AbhishekAsh


People also ask

How do you find multiple occurrences of a string in regex?

Method 1: Regex re. To get all occurrences of a pattern in a given string, you can use the regular expression method re. finditer(pattern, string) . The result is an iterable of match objects—you can retrieve the indices of the match using the match.

What does ?= * Mean in regex?

. means match any character in regular expressions. * means zero or more occurrences of the SINGLE regex preceding it.

How do you match everything after a word in regex?

If you want . to match really everything, including newlines, you need to enable "dot-matches-all" mode in your regex engine of choice (for example, add re. DOTALL flag in Python, or /s in PCRE.

What does \f mean in regex?

Definition and Usage The \f metacharacter matches form feed characters.


1 Answers

Use capturing inside a positive lookahead with the following regex:

Sometimes, you need several matches within the same word. For instance, suppose that from a string such as ABCD you want to extract ABCD, BCD, CD and D. You can do it with this single regex:

(?=(\w+))

At the first position in the string (before the A), the engine starts the first match attempt. The lookahead asserts that what immediately follows the current position is one or more word characters, and captures these characters to Group 1. The lookahead succeeds, and so does the match attempt. Since the pattern didn't match any actual characters (the lookahead only looks), the engine returns a zero-width match (the empty string). It also returns what was captured by Group 1: ABCD

The engine then moves to the next position in the string and starts the next match attempt. Again, the lookahead asserts that what immediately follows that position is word characters, and captures these characters to Group 1. The match succeeds, and Group 1 contains BCD.

The engine moves to the next position in the string, and the process repeats itself for CD then D.

So, use

(?=(~[^\s~]+~))

See the regex demo

The pattern (?=(~[^\s~]+~)) checks each position inside a string and searches for ~ followed with 1+ characters other than whitespace and ~ and then followed with another ~. Since the index is moved only after a position is checked, and not when the value is captured, overlapping substrings get extracted.

Java demo:

String text = " ~*_abc~xyz~ ~123~";
Pattern p = Pattern.compile("(?=(~[^\\s~]+~))");
Matcher m = p.matcher(text);
List<String> res = new ArrayList<>();
while(m.find()) {
    res.add(m.group(1));
}
System.out.println(res); // => [~*_abc~, ~xyz~, ~123~]

Just in case someone needs a Python demo:

import re
p = re.compile(r'(?=(~[^\s~]+~))')
test_str = " ~*_abc~xyz~ ~123~"
print(p.findall(test_str))
# => ['~*_abc~', '~xyz~', '~123~']
like image 69
Wiktor Stribiżew Avatar answered Oct 31 '22 23:10

Wiktor Stribiżew