Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What can I do to speed up this code (String Similarity)?

Tags:

c++

string

This is a code written in C++ using standard libraries to find the String Similarity of a string S with each of it's suffixes.

Though, it gives the correct output, it takes a lot of time in doing so for large strings. Here's the code:

#include <iostream>
#include <string>
using namespace std;

int sim(string a, string b){
    int count=0;
    int sa=a.size();
    int sb=b.size();
    int iter;
    if(sa>sb) iter=sb;
    else iter=sa;
    for(int i=0; i<iter; i++){
        if (a[i]!=b[i]) break;
        else count++;
    }
    return count;
}

int strsim(string a){
    int sum=0;
    int s=a.size();
    for(int i=0; i<s; i++){
        sum=sum+sim(a,a.substr(i));
    }
    return sum;
}

int main(){
    int n;
    cin >> n;
    string a[n];
    for(int i=0; i<n; i++){
        cin >> a[i];
    }
    for(int i=0; i<n; i++){
        cout << strsim(a[i]) << '\n';
    }
}

Constraints: The length of each string is at most 100000 and contains only lower case characters and the number of testcases, 'n' cannot exceed 10.

Sample I/O:

Input:

1 ababaa

Output:

11

i.e, 6 + 0 + 3 + 0 + 1 + 1 = 11

like image 674
Ranveer Avatar asked Jul 25 '13 18:07

Ranveer


2 Answers

Your current code computes for a single string for length L, in O(L^3) (substr takes linear running time). Not to mention high constant costs to the above complexity due to inefficient passing around of strings.

Your algorithm can be simply reduced to finding longest common prefixes of a string with all its suffixes. This can be easily done using Suffix Aray. The concept cannot be explained as an answer, so I will highly recommend you to read this.

A sub optimal and simple to code Suffix Array solution will have O(Llg^2(L)) (L = string length) construction time and O(1) time to query Longest Common Prefix of 2 suffixes using Range Minimum Query. Note that the whole string itself is its own suffix. In your case, you need to make L queries for each string. So total complexity for one string will be O(Llg^2(L)) + O(L).

If you want to improve further, You may reduce the construction time to O(Llg(L)) by using radix sort, or to O(L) (Read)

like image 118
nims Avatar answered Nov 12 '22 07:11

nims


The biggest cost you have here is that you are passing the strings by value - that means that each call to "sim" creates two brand new strings and copies over the data to them. You should eliminate this and replace it with pass-by-reference.

#include <iostream>
#include <string>
#include <vector>
using namespace std;

size_t compareSubstring(const std::string& str, size_t offset)
{
    size_t count = 0;
    std::string::const_iterator lhsIt = str.begin();
    std::string::const_iterator rhsIt = str.begin() + offset;
    for ( ; rhsIt != str.end() ; ++lhsIt, ++rhsIt ) {
        if (*lhsIt != *rhsIt)
            break;
        ++count;
    }
    return count;
}

int compareString(const string& str)
{
    size_t count = 0;
    const size_t length = str.size();
    for(size_t i = 0; i < length; ++i) {
        count += compareSubstring(str, i);
    }
    return count;
}

int main()
{
    size_t numStrings = 0;
    std::cin >> numStrings;

    std::vector<std::string> strings;
    strings.resize(numStrings);

    for(size_t i = 0; i < numStrings; ++i) {
        std::cin >> strings[i];
    }

    for(size_t i = 0; i < numStrings; ++i) {
        std::cout << compareString(strings[i]) << std::endl;
    }
}
like image 3
kfsone Avatar answered Nov 12 '22 07:11

kfsone