Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract every complete word that contains a certain substring

Tags:

java

regex

I'm trying to write a function that extracts each word from a sentence that contains a certain substring e.g. Looking for 'Po' in 'Porky Pork Chop' will return Porky Pork.

I've tested my regex on regexpal but the Java code doesn't seem to work. What am I doing wrong?

private static String foo()
    {

        String searchTerm = "Pizza";
        String text = "Cheese Pizza";

        String sPattern =  "(?i)\b("+searchTerm+"(.+?)?)\b";
        Pattern pattern = Pattern.compile ( sPattern );
        Matcher matcher = pattern.matcher ( text );
        if(matcher.find ())
        {
            String result = "-";
            for(int i=0;i < matcher.groupCount ();i++)
            {
                result+= matcher.group ( i ) + " ";
            }
            return result.trim ();
        }else
        {
            System.out.println("No  Luck");
        }
    }
like image 788
W.K.S Avatar asked Jul 27 '13 20:07

W.K.S


2 Answers

  1. In Java to pass \b word boundaries to regex engine you need to write it as \\b. \b represents backspace in String object.

  2. Judging by your example you want to return all words that contains your substring. To do this don't use for(int i=0;i < matcher.groupCount ();i++) but while(matcher.find()) since group count will iterate over all groups in single match, not over all matches.

  3. In case your string can contain some special characters you probably should use Pattern.quote(searchTerm)

  4. In your code you are trying to find "Pizza" in "Cheese Pizza" so I assume that you also want to find strings that same as searched substring. Although your regex will work fine for it, you can change your last part (.+?)?) to \\w* and also add \\w* at start if substring should also be matched in the middle of word (not only at start).

So your code can look like

private static String foo() {

    String searchTerm = "Pizza";
    String text = "Cheese Pizza, Other Pizzas";

    String sPattern = "(?i)\\b\\w*" + Pattern.quote(searchTerm) + "\\w*\\b";
    StringBuilder result = new StringBuilder("-").append(searchTerm).append(": ");

    Pattern pattern = Pattern.compile(sPattern);
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        result.append(matcher.group()).append(' ');
    }
    return result.toString().trim();
}
like image 168
Pshemo Avatar answered Oct 13 '22 02:10

Pshemo


While the regex approach is certainly a valid method, I find it easier to think through when you split the words up by whitespace. This can be done with String's split method.

public List<String> doIt(final String inputString, final String term) {
    final List<String> output = new ArrayList<String>();
    final String[] parts = input.split("\\s+");
    for(final String part : parts) {
        if(part.indexOf(term) > 0) {
            output.add(part);
        }
    }
    return output;
}

Of course it is worth nothing that doing this will effectively be doing two passes through your input String. The first pass to find the characters that are whitespace to split on, and the second pass looking through each split word for your substring.

If one pass is necessary though, the regex path is better.

like image 23
nicholas.hauschild Avatar answered Oct 13 '22 01:10

nicholas.hauschild