I have a set of words say -- apple, orange, pear , banana, kiwi
I want to check if a sentence contains any of the above listed words, and If it does , I want to find which word matched. How can I accomplish this in Regex ?
I am currently calling String.indexOf() for each of my set of words. I am assuming this is not as efficient as a regex matching?
You can use the PHP strpos() function to check whether a string contains a specific word or not. The strpos() function returns the position of the first occurrence of a substring in a string. If the substring is not found it returns false . Also note that string positions start at 0, and not 1.
The contains() method checks whether a string contains a sequence of characters. Returns true if the characters exist and false if not.
The Java String contains() method is used to check whether the specific set of characters are part of the given string or not. It returns a boolean value true if the specified characters are substring of a given string and returns false otherwise. It can be directly used inside the if statement.
It doesn't work with regex. It will check whether the exact String specified appear in the current String or not. Note that String. contains does not check for word boundary; it simply checks for substring.
TL;DR For simple substrings
contains()
is best but for only matching whole words Regular Expression are probably better.
The best way to see which method is more efficient is to test it.
You can use String.contains()
instead of String.indexOf()
to simplify your non-regexp code.
To search for different words the Regular Expression looks like this:
apple|orange|pear|banana|kiwi
The |
works as an OR
in Regular Expressions.
My very simple test code looks like this:
public class TestContains { private static String containsWord(Set<String> words,String sentence) { for (String word : words) { if (sentence.contains(word)) { return word; } } return null; } private static String matchesPattern(Pattern p,String sentence) { Matcher m = p.matcher(sentence); if (m.find()) { return m.group(); } return null; } public static void main(String[] args) { Set<String> words = new HashSet<String>(); words.add("apple"); words.add("orange"); words.add("pear"); words.add("banana"); words.add("kiwi"); Pattern p = Pattern.compile("apple|orange|pear|banana|kiwi"); String noMatch = "The quick brown fox jumps over the lazy dog."; String startMatch = "An apple is nice"; String endMatch = "This is a longer sentence with the match for our fruit at the end: kiwi"; long start = System.currentTimeMillis(); int iterations = 10000000; for (int i = 0; i < iterations; i++) { containsWord(words, noMatch); containsWord(words, startMatch); containsWord(words, endMatch); } System.out.println("Contains took " + (System.currentTimeMillis() - start) + "ms"); start = System.currentTimeMillis(); for (int i = 0; i < iterations; i++) { matchesPattern(p,noMatch); matchesPattern(p,startMatch); matchesPattern(p,endMatch); } System.out.println("Regular Expression took " + (System.currentTimeMillis() - start) + "ms"); } }
The results I got were as follows:
Contains took 5962ms Regular Expression took 63475ms
Obviously timings will vary depending on the number of words being searched for and the Strings being searched, but contains()
does seem to be ~10 times faster than regular expressions for a simple search like this.
By using Regular Expressions to search for Strings inside another String you're using a sledgehammer to crack a nut so I guess we shouldn't be surprised that it's slower. Save Regular Expressions for when the patterns you want to find are more complex.
One case where you may want to use Regular Expressions is if indexOf()
and contains()
won't do the job because you only want to match whole words and not just substrings, e.g. you want to match pear
but not spears
. Regular Expressions handle this case well as they have the concept of word boundaries.
In this case we'd change our pattern to:
\b(apple|orange|pear|banana|kiwi)\b
The \b
says to only match the beginning or end of a word and the brackets group the OR expressions together.
Note, when defining this pattern in your code you need to escape the backslashes with another backslash:
Pattern p = Pattern.compile("\\b(apple|orange|pear|banana|kiwi)\\b");
I don't think a regexp will do a better job in terms of performance but you can use it as follow:
Pattern p = Pattern.compile("(apple|orange|pear)"); Matcher m = p.matcher(inputString); while (m.find()) { String matched = m.group(1); // Do something }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With