Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the fastest substring search method in Java

I need to implement a way to search substring (needles) in a list of string (haystack) using Java.

More specifically, my app has a list of user profiles. If I type some letters, for example, "Ja", and then search, then all the users whose name contains "ja" should show up. For instance, the result could be "Jack", "Jackson", "Jason", "Dijafu".

In Java, as I know, there are 3 build-in method to see search substring in a string.

  1. string.contains()

  2. string.indexOf()

  3. regular expression. it is something like string.matches("ja"))

My question is: What are the runtimes of each method above? which one is the fastest or most efficient or most popular way to check if the list of string contains a given substring.

I know there exists some algorithms that do the same thing, such as Boyer–Moore string search algorithm, Knuth–Morris–Pratt algorithm and so on. I do not want to use them because I just have a small list of strings, and I think using them is kind of overkill for me right now. Also I have to type a lot of extra coding for such a non-build-in algorithm. If you think my thoughts is not correct, please feel free to correct me.

like image 566
Joey Avatar asked Aug 20 '13 16:08

Joey


People also ask

What is the fastest string search algorithm?

The Aho-Corasick string searching algorithm simultaneously finds all occurrences of multiple patterns in one pass through the text. On the other hand, the Boyer-Moore algorithm is understood to be the fastest algorithm for a single pattern.

Is substring fast Java?

substring() is blazingly fast. The only thing that would be faster still is to avoid object creation completely by keeping track of the substring start and end positions in variables.

Which method is used to search for a substring in Java?

To locate a substring in a string, use the indexOf() method.

Does string contain slow Java?

String. contains is not always faster than regex. For normal cases, contains is faster.


1 Answers

The accepted answer is not correct and not complete.

  • indexOf() does a naive string search using backtracking on mismatches. This is quite fast on small patterns/texts but shows very poor performance on large texts
  • contains("ja") should be comparable to indexOf (because it delegates to it)
  • matches("ja") will not deliver the correct result, because it searches for an exact match (only the string "ja" will match exactly)
  • Pattern p = Pattern.compile("ja"); Matcher m = p.matcher("jack"); m.find(); would be the correct way to find texts with regular expressions. In practice (using large texts) it will be the most efficient way using only the java api. This is because a constant pattern (like "ja") will not be processed by the regex engine (which is slow) but by an Boyer-Moore-Algorithm (which is fast)
like image 153
CoronA Avatar answered Sep 21 '22 12:09

CoronA