Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Java Regex, how to check if a string contains any of the words in a set ?

I have a set of words say -- apple, orange, pear , banana, kiwi

I want to check if a sentence contains any of the above listed words, and If it does , I want to find which word matched. How can I accomplish this in Regex ?

I am currently calling String.indexOf() for each of my set of words. I am assuming this is not as efficient as a regex matching?

like image 963
user193116 Avatar asked Mar 01 '12 11:03

user193116


People also ask

How do you check if a string contains a set of words?

You can use the PHP strpos() function to check whether a string contains a specific word or not. The strpos() function returns the position of the first occurrence of a substring in a string. If the substring is not found it returns false . Also note that string positions start at 0, and not 1.

How do I check if a string contains a specific word Java?

The contains() method checks whether a string contains a sequence of characters. Returns true if the characters exist and false if not.

How do you check if a string contains a set of characters in Java?

The Java String contains() method is used to check whether the specific set of characters are part of the given string or not. It returns a boolean value true if the specified characters are substring of a given string and returns false otherwise. It can be directly used inside the if statement.

Can we use regex in Contains method in Java?

It doesn't work with regex. It will check whether the exact String specified appear in the current String or not. Note that String. contains does not check for word boundary; it simply checks for substring.


2 Answers

TL;DR For simple substrings contains() is best but for only matching whole words Regular Expression are probably better.

The best way to see which method is more efficient is to test it.

You can use String.contains() instead of String.indexOf() to simplify your non-regexp code.

To search for different words the Regular Expression looks like this:

apple|orange|pear|banana|kiwi 

The | works as an OR in Regular Expressions.

My very simple test code looks like this:

public class TestContains {     private static String containsWord(Set<String> words,String sentence) {      for (String word : words) {        if (sentence.contains(word)) {          return word;        }      }       return null;    }     private static String matchesPattern(Pattern p,String sentence) {      Matcher m = p.matcher(sentence);       if (m.find()) {        return m.group();      }       return null;    }     public static void main(String[] args) {      Set<String> words = new HashSet<String>();      words.add("apple");      words.add("orange");      words.add("pear");      words.add("banana");      words.add("kiwi");       Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi");       String noMatch = "The quick brown fox jumps over the lazy dog.";      String startMatch = "An apple is nice";      String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi";       long start = System.currentTimeMillis();      int iterations = 10000000;       for (int i = 0; i < iterations; i++) {        containsWord(words, noMatch);        containsWord(words, startMatch);        containsWord(words, endMatch);      }       System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms");      start = System.currentTimeMillis();       for (int i = 0; i < iterations; i++) {        matchesPattern(p,noMatch);        matchesPattern(p,startMatch);        matchesPattern(p,endMatch);      }       System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms");    } } 

The results I got were as follows:

Contains took 5962ms Regular Expression took 63475ms 

Obviously timings will vary depending on the number of words being searched for and the Strings being searched, but contains() does seem to be ~10 times faster than regular expressions for a simple search like this.

By using Regular Expressions to search for Strings inside another String you're using a sledgehammer to crack a nut so I guess we shouldn't be surprised that it's slower. Save Regular Expressions for when the patterns you want to find are more complex.

One case where you may want to use Regular Expressions is if indexOf() and contains() won't do the job because you only want to match whole words and not just substrings, e.g. you want to match pear but not spears. Regular Expressions handle this case well as they have the concept of word boundaries.

In this case we'd change our pattern to:

\b(apple|orange|pear|banana|kiwi)\b 

The \b says to only match the beginning or end of a word and the brackets group the OR expressions together.

Note, when defining this pattern in your code you need to escape the backslashes with another backslash:

 Pattern p = Pattern.compile("\\b(apple|orange|pear|banana|kiwi)\\b"); 
like image 169
Dave Webb Avatar answered Sep 17 '22 16:09

Dave Webb


I don't think a regexp will do a better job in terms of performance but you can use it as follow:

Pattern p = Pattern.compile("(apple|orange|pear)"); Matcher m = p.matcher(inputString); while (m.find()) {    String matched = m.group(1);    // Do something } 
like image 44
Guillaume Polet Avatar answered Sep 20 '22 16:09

Guillaume Polet