Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to get all substring for a given regex?

Tags:

java

string

regex

I need to get all substrings matching a regex, I know I can probably build an automaton for it, but I am looking for a simpler solution.
the problem is, Matcher.find() doesn't return all results.

String str = "abaca";
Matcher matcher = Pattern.compile("a.a").matcher(str);
while (matcher.find()) {
   System.out.println(str.substring(matcher.start(),matcher.end()));
}

The result is aba and not aba,acaas I want...
any ideas?
EDIT: another example: for string=abaa, regex=a.*a I am expecting to get aba,abaa,aa
p.s. if it cannot be achieved using regular expressions, it's also an answer, I just want to know I'm not re-inventing the wheel for something the language already provides me with...

like image 635
amit Avatar asked Apr 18 '11 15:04

amit


People also ask

What does '$' mean in RegEx?

$ means "Match the end of the string" (the position after the last character in the string).


2 Answers

You could do something like this:

import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static List<String> getAllMatches(String text, String regex) {
        List<String> matches = new ArrayList<String>();
        Matcher m = Pattern.compile("(?=(" + regex + "))").matcher(text);
        while(m.find()) {
            matches.add(m.group(1));
        }
        return matches;
    }

    public static void main(String[] args) {
        System.out.println(getAllMatches("abaca", "a.a"));
        System.out.println(getAllMatches("abaa", "a.*a"));
    }
}

which prints:

[aba, aca]
[abaa, aa]

The only thing is that you're missing aba from the last matches-list. This is because of the greedy .* in a.*a. You can't fix this with regex. You could do this by iterating over all possible substrings and call .matches(regex) on each substring:

public static List<String> getAllMatches(String text, String regex) {
    List<String> matches = new ArrayList<String>();
    for(int length = 1; length <= text.length(); length++) {
        for(int index = 0; index <= text.length()-length; index++) {
            String sub = text.substring(index, index + length);
            if(sub.matches(regex)) {
                matches.add(sub);
            }
        }
    }
    return matches;
}

If your text will stay relatively small, this will work, but for larger strings, this may become too computationally intense.

like image 149
Bart Kiers Avatar answered Sep 28 '22 11:09

Bart Kiers


By default new match starts at the end of the previous one. If youe matches can overlap, you need to specify start point manually:

int start = 0;
while (matcher.find(start)) { 
    ...
    start = matcher.start() + 1;
}
like image 24
axtavt Avatar answered Sep 28 '22 13:09

axtavt