Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster Algorithm for string comparing in c#

I have two sentences that needed to be compared to each-other. The final result is how much percent one sentence contains in the other, my problem is that I have 100.000 records that need to be compared with lets say another 10. That is 1.000.000 loops, which in my algorithm is very slow.

This is the algorithm that I am using:

private double BreakStringsAndCheck(string s1, string s2)
{
    if (s1 == null || s2 == null || s1.Length == 0 || s2.Length == 0)
        return (double)0;
    string[] firstArray = s1.Split(' ');
    string[] secondArray = s2.Split(' ');
    if (firstArray.Length > secondArray.Length)
    {
        string[] tempArray = firstArray;
        firstArray = secondArray;
        secondArray = tempArray;
    }
    double value = 0;
    for (int i = 0; i < firstArray.Length; i++)
        for (int j = 0; j < secondArray.Length; j++)
            value += firstArray[i] == secondArray[j] ? (double)100 : (double)0;
    return findLongest ? value : value / firstArray.Length;
}

It's a small method but it is not very fast. From my testing I can do 40-60 comparisons in 1 second, that is almost 5 hours for 1.000.000 loops.

Can some one think of another method or logic that is much faster than this?

Update:

I will try to explain the problem with more details. I have database with more than 100.000 records, and every day I insert, and compare 10-20 new record in this database. This records are sentences from 2 to 10 words and I need to write fast method that will compare this new records with those in database, the result should be percentage of how much one sentence contains the words from the other.

I need the records that has more than 70% word match.

I hope that I'm clear now.

like image 294
Pece Avatar asked Nov 23 '10 12:11

Pece


People also ask

Is strncmp faster than strcmp?

If n is sufficiently large that strncmp will compare the whole strings (so that the behavior becomes effectively the same as strcmp ) then strncmp is likely to be moderately slower because it also has to keep track of a counter, but the difference might or might not be measurable or even present in a given ...

What is the best way to compare strings?

The right way of comparing String in Java is to either use equals(), equalsIgnoreCase(), or compareTo() method. You should use equals() method to check if two String contains exactly same characters in same order. It returns true if two String are equal or false if unequal.

How do I properly compare strings in C?

The strcmp() function is a C library function used to compare two strings in a lexicographical manner. Syntax: int strcmp ( const char * str1, const char * str2 ); The function returns 0 if both the strings are equal or the same.

Is strcmp fast?

While comparing the two function, in a loop repeted 500'000'000 times, strcmp execute too much fast, about x750 times faster. The execution time of that function is 3.058s , while strcmp only 0.004s .


2 Answers

I'm not a C# programmer, but here are a few general tips:

  1. Move the floating point arithmetic out of the loop. You should be able to count the characters that match and do the division later.
  2. You should be able to run each "long" loop in a separate thread of execution since the data is static. I would spawn a separate thread for each of your "10" sentences and run them in parallel.
  3. You might want to remove the call to split if you can. Basically, remove any extra memory allocations.

The final thought is to grab an algorithms book or google for text processing algorithms. This problem sounds like something that has been solved over and over again. There is probably something in AOCP v3 that solves this problem. You could also profile the code (not sure what types of profilers are available), but that probably won't yield substantial improvements.

like image 189
D.Shawley Avatar answered Oct 07 '22 15:10

D.Shawley


Have you looked at the Intersect method as an alternative. I have no idea about its performance but it looks like it may work

like image 37
Ahmad Avatar answered Oct 07 '22 16:10

Ahmad