Say I have the string "Torcellite" and another string "Tor" - the similarity length of these two strings is 3 since both of them begin with "Tor". Now another string "christmas" and "mas" would have a similarity of 0 since they do not begin with the same set of characters. In both cases, the second string is a suffix of the first string. A clearer example: Length of string: 1 to 10^5 String: <code>abaabc</code> Suffixes: <code>abaabc</code>, <code>baabc</code>, <code>aabc</code>, <code>abc</code>, <code>bc</code>, <code>c</code> Similarity: <code>abaabc</code>, none, <code>a</code>, <code>ab</code>, none, none Similarity Length: 6, 0, 1, 2, 0, 0 Answer: 6+0+1+2+0+0 = 9 I have an inefficient logic to find these partial suffixes matches using regex. Algorithm: <ul> <li>Find all the substrings of the given string.</li> <li> Make a pattern from the substrings of the suffixes. <pre class="prettyprint"><code>for(int i=1; i<substrings[i].length; i++) { Pattern p = Pattern.compile("^"+substrings[i].substring(0, i)); Matcher m = p.find(string); //the given string for which similarities need to be calculated if(m.find()) similaryLengths += i; } </code></pre> </li> <li>The complexity for this becomes roughly O(n^2) since I need to run through the string for suffixes and then the substrings for patterns.</li> <li>I've thought of using grouping in the pattern to find the groups but I'm unsure what the regex would look like. What I have in mind is for the very first substring is: <code>((((((a)b)a)a)b)c)</code> and then find the longest group match.</li> </ul> Is there a more efficient algorithm that can achieve his?

Simliar algorithm as already posted by Valdar Moridin, but without the need to create substrings (every call to <code>substring</code> will create a new <code>String</code> object that contains a copy of the specified range of the <code>char[]</code> of its source). This won't improve the time complexity, but probably reduces the total runtime for a constant factor: <pre class="prettyprint"><code>public static int partialSuffixMatch(CharSequence input) { int count = input.length(); for (int i = 1; i < input.length(); i++) { for (int a = 0, b = i; b < input.length(); a++, b++) { if (input.charAt(a) != input.charAt(b)) break; count++; } } return count; } </code></pre> After a short warmup, this algorithm processes a <code>String</code> with 10,000 equal characters in about 40 ms on my computer, and with 100,000 equal characters in about 4 seconds.

Regular Expression for partial match length - String similarity

Tags:

java

string

regex

pattern-matching

Say I have the string "Torcellite" and another string "Tor" - the similarity length of these two strings is 3 since both of them begin with "Tor". Now another string "christmas" and "mas" would have a similarity of 0 since they do not begin with the same set of characters.

In both cases, the second string is a suffix of the first string.

A clearer example:

Length of string: 1 to 10^5

String: abaabc

Suffixes: abaabc, baabc, aabc, abc, bc, c

Similarity: abaabc, none, a, ab, none, none

Similarity Length: 6, 0, 1, 2, 0, 0

Answer: 6+0+1+2+0+0 = 9

I have an inefficient logic to find these partial suffixes matches using regex.

Algorithm:

Find all the substrings of the given string.

Make a pattern from the substrings of the suffixes.

for(int i=1; i<substrings[i].length; i++) {
    Pattern p = Pattern.compile("^"+substrings[i].substring(0, i));
    Matcher m = p.find(string); //the given string for which similarities need to be  calculated
    if(m.find())
        similaryLengths +=  i;
}

The complexity for this becomes roughly O(n^2) since I need to run through the string for suffixes and then the substrings for patterns.
I've thought of using grouping in the pattern to find the groups but I'm unsure what the regex would look like. What I have in mind is for the very first substring is: ((((((a)b)a)a)b)c) and then find the longest group match.

Is there a more efficient algorithm that can achieve his?

268

asked Aug 11 '14 03:08

Karthik Balakrishnan

2 Answers

The best approach, by far, would be to build a suffix tree on the input string. Building suffix trees takes only O(n) time where n is the length of the string. A suffix tree consists logically of a tree in which all the suffixes of the string can be found by walking from the root to each leaf. You can read Wikipedia for more details on how these trees work.

Essentially, a suffix tree will allow you to simply recast your current problem as one of "finding" the original string in the suffix tree. As you walk down the tree, you count the number of suffixes in each subtree, and multiply by your current match length to determine your score. This "search" takes O(n) time too.

So the final result is that you can solve the problem in guaranteed O(n) time and O(n) space, with O(n) preprocessing time. This is pretty efficient! And, there are no "worst cases" that produce quadratic behaviour. With that you can probably handle strings up to 10^7 in length easily.

The only difficulty in implementation will be building the suffix tree, but you can find freely available code for that.

answered Oct 21 '22 23:10

nneonneo

Simliar algorithm as already posted by Valdar Moridin, but without the need to create substrings (every call to substring will create a new String object that contains a copy of the specified range of the char[] of its source). This won't improve the time complexity, but probably reduces the total runtime for a constant factor:

public static int partialSuffixMatch(CharSequence input) {
    int count = input.length();
    for (int i = 1; i < input.length(); i++) {
        for (int a = 0, b = i; b < input.length(); a++, b++) {
            if (input.charAt(a) != input.charAt(b))
                break;
            count++;
        }
    }
    return count;
}

After a short warmup, this algorithm processes a String with 10,000 equal characters in about 40 ms on my computer, and with 100,000 equal characters in about 4 seconds.

answered Oct 21 '22 22:10

isnot2bad

Related questions
                            
                                Java: socket read time out exception
                            
                                jmx/jstatd access to remote machine through an ssh tunnel
                            
                                Call method in JPA
                            
                                Programmatically configure PPTP/ L2TP VPN in Android
                            
                                Intercept Log Messages slf4j
                            
                                What does the exported attribute in .classpath do? [duplicate]
                            
                                Hungarian Algorithm: How to cover 0 elements with minimum lines?
                            
                                Is String.intern() thread safe
                            
                                Remove a single rule check from PMD in maven plugin
                            
                                System clear Property doesn't work. How can it be?
                            
                                How to manage project dependencies using Maven?
                            
                                Deep clone of Hibernate entity
                            
                                How to unpack an array into different arguments on method call
                            
                                Can I get a notification whenever the user interacts with an Android device?
                            
                                MessageBodyReader not found for media type=application/octet-stream
                            
                                How to create .docx files and .xlsx files on Android
                            
                                Spring Data JPA Specification to Select Specific Columns
                            
                                Joins in Java 8 Collection API
                            
                                sun.reflect.Reflection.getCallerClass alternative
                            
                                Spring MVC with hibernate Validator to validate single basic type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With