Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.NET library for text algorithms?

Do you know any .NET library for text algorithms??
Especially I'm interested in strings match, and full-text-search algorithms like

  • Bitap algorithm
  • Levenshtein distance
  • Damerau–Levenshtein distance

I know the one I have mentioned are pretty simple to code, but there are hundreds of text algorithms, i don't want to code them all by myself.
If there is no such .NET library known, you can mention C, C++ library, coding wrapper will be easer than coding from zero.

like image 509
Alex Burtsev Avatar asked Dec 22 '10 10:12

Alex Burtsev


5 Answers

You may be interested in checking out the google-diff-match-patch library on Google Code. They have an implementation of Myer's diff algorithm and it claims to also implement a Bitap algorithm "at the heart".

It has the C# source that you're looking for as well as implementations in Java, C++, Lua & Python. Although I don't have the best understanding of how to use Bitap in practice (there are demos in the Google Code project) I think you'll be most interested in the match functions starting around line 1476 of the current version.

UPDATE:

A little digging found an implementation of Levenshtein in C# on CodeProject.

Also, this C# class file contains an implementation of Levenshtein on SourceForge. The implementation is part of the Corsis (aka Tenka Text) project. Author claims that the YetiLevenshtein method (around line 741) is 2x to 10x faster than the implementation used in the CodeProject version of the algorithm referenced above.

UPDATE #2:

I just discovered the wikibook Algorithm implementation with it's C# version of Levenshtein Distance and had to include it because it looks pretty straight and to the point. This wikibook looks like a great reference to keep on hand in general.

Levenshtein Distance in C# (courtesy of Wikibooks)

    private Int32 levenshtein(String a, String b)
    {

        if (string.IsNullOrEmpty(a))
        {
            if (!string.IsNullOrEmpty(b))
            {
                return b.Length;
            }
            return 0;
        }

        if (string.IsNullOrEmpty(b))
        {
            if (!string.IsNullOrEmpty(a))
            {
                return a.Length;
            }
            return 0;
        }

        Int32 cost;
        Int32[,] d = new int[a.Length + 1, b.Length + 1];
        Int32 min1;
        Int32 min2;
        Int32 min3;

        for (Int32 i = 0; i <= d.GetUpperBound(0); i += 1)
        {
            d[i, 0] = i;
        }

        for (Int32 i = 0; i <= d.GetUpperBound(1); i += 1)
        {
            d[0, i] = i;
        }

        for (Int32 i = 1; i <= d.GetUpperBound(0); i += 1)
        {
            for (Int32 j = 1; j <= d.GetUpperBound(1); j += 1)
            {
                cost = Convert.ToInt32(!(a[i-1] == b[j - 1]));

                min1 = d[i - 1, j] + 1;
                min2 = d[i, j - 1] + 1;
                min3 = d[i - 1, j - 1] + cost;
                d[i, j] = Math.Min(Math.Min(min1, min2), min3);
            }
        }

        return d[d.GetUpperBound(0), d.GetUpperBound(1)];

    }
like image 173
Saul Dolgin Avatar answered Nov 04 '22 09:11

Saul Dolgin


I managed to find implementations of most algorithms i need using combination of WikiPedia + Google Code search.

http://en.wikipedia.org/wiki/Category:Algorithms_on_strings
http://www.google.com/codesearch

Though it's strange that no one has created project on this subject, where interested people could collaborate on this.

like image 24
Alex Burtsev Avatar answered Nov 04 '22 10:11

Alex Burtsev


If you're doing string matching, Lucene.Net is worth a look.

However, I know this is not exactly what you're after, and while you can find most of those algorithms in some C# form around, I know of no library incorporating them (I've tended to keep a couple of these in my personal library).

Out of interest, why would you ever need more than one of these full-match algorithms with a couple of threshold parameters?

like image 3
Matt Mitchell Avatar answered Nov 04 '22 09:11

Matt Mitchell


I suggest SimMetrics library, it has many different algorithms for string matching. Available also on NuGet.

Short description:

SimMetrics is a Similarity Metric Library, e.g. from edit distance's (Levenshtein, Gotoh, Jaro etc) to other metrics, (e.g Soundex, Chapman).

GPLv2 license.

like image 2
watbywbarif Avatar answered Nov 04 '22 08:11

watbywbarif


here is one I implemented for Levenshtein / Damerau–Levenshtein distance:

    public static int GetDistance(string left, string right, bool isDamerauDistanceApplied)
    {
        if (left.Length == 0) return right.Length;
        if (right.Length == 0) return left.Length;

        var lenLeft = left.Length;
        var lenRight = right.Length;

        var matrix = new int[lenLeft + 1, lenRight + 1];

        for (var i = 0; i <= lenLeft; i++)
            matrix[i, 0] = i;

        for (var i = 0; i <= lenRight; i++)
            matrix[0, i] = i;

        for (var i = 1; i <= lenLeft; i++)
        {
            for (var j = 1; j <= lenRight; j++)
            {
                var cost = (left[i - 1] == right[j - 1]) ? 0 : 1;

                matrix[i, j] = Math.Min(Math.Min(matrix[i - 1, j] + 1, matrix[i, j - 1] + 1), matrix[i - 1, j - 1] + cost);

                if (isDamerauDistanceApplied)
                {
                    // Fixed for string base 0 index.
                    if (i > 1 && j > 1 && left[i - 1] == right[j - 2] && left[i - 2] == right[j - 1])
                    {
                        matrix[i, j] = Math.Min(matrix[i, j], matrix[i - 2, j - 2] + cost);
                    }
                }
            }
        }

        return matrix[lenLeft, lenRight];
    }
like image 1
mathsRuinedme Avatar answered Nov 04 '22 10:11

mathsRuinedme