I read the source code of java.lang.String
and I was surprised to find that String.indexof()
does not use the Knuth–Morris–Pratt algorithm? As we know, KMP is more effective. So why isn't it used in String.indexOf()
?
Someone around me told me that for short string KMP is good enough, but if you need performance and you intend to use with large strings then is not a good choice. However he didn't tell me the details.
So, here are my questions:
String.indexOf()
?The complexity of Java's implementation of indexOf is O(m*n) where n and m are the length of the search string and pattern respectively. What you can do to improve complexity is to use e.g., the Boyer-More algorithm to intelligently skip comparing logical parts of the string which cannot match the pattern.
indexOf() – also runs in linear time. It iterates through the internal array and checks each element one by one, so the time complexity for this operation always requires O(n) time.
Java String indexOf() Method The indexOf() method returns the position of the first occurrence of specified character(s) in a string. Tip: Use the lastIndexOf method to return the position of the last occurrence of specified character(s) in a string.
The indexOf method accepts two parameters. The first is the element to check if given array contains it or not. The other parameter, starting index, is optional.
KMP has better worst-case performance, but actually requires a little bit of up-front computation (to generate the table of offsets). It also requires an initial memory allocation, which could also impact performance.
For (presumably) common use-cases of searching in relatively short strings, this might actually end up slower than the primitive implementation.
This, bundled with the fact that for really huge data sets you will probably be using more specialized data structures than a simple String
means that the increased implementation (and possibly runtime) cost is not worth investing.
Note that this might change in future Java versions, as the actual algorithm is not specified.
KMP and several other asymptotically efficient string search methods like Boyer-Moore and Boyer-Moore-Horspool require extra memory -- in the case of KMP, O(m) memory, where m is the size of the substring being searched for. Although this is often acceptable, library designers have to make tradeoffs so that their code performs acceptably well in many different situations. Probably the main reason is that due to both the preprocessing required by KMP, and its more complex inner loop in the search phase, the constant factor slowdown may make it several times slower than the naive O(mn) substring search in many common cases (e.g. searching for a substring of < 10 characters in a long string). Also, someone searching for a large substring might be perplexed to find the runtime library running out of memory as it tries to allocate a large memory buffer for the KMP fallback function table.
Perhaps a better question is why O(m+n)-time, O(1)-space algorithms like the Two-Way Algorithm have not yet been adopted by mainstream language runtime libraries. Again, the answer is likely to be the constant-factor slowdown in common cases. Nevertheless in at least one C runtime library implementation, the corresponding strstr()
function has been updated to use this algorithm.
Someone around me told me that for short string KMP is good enough, but if you need performance and you intend to use with large string then is not a good choice.
Well, that's exactly backwards from my understanding, which is that the naive O(mn) substring search is good enough (and probably the best) for short strings, but will eventually lose out to asymptotically faster O(m+n) algorithms like KMP as the strings become longer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With