Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# Finding relevant document snippets for search result display

In developing search for a site I am building, I decided to go the cheap and quick way and use Microsoft Sql Server's Full Text Search engine instead of something more robust like Lucene.Net.

One of the features I would like to have, though, is google-esque relevant document snippets. I quickly found determining "relevant" snippets is more difficult than I realized.

I want to choose snippets based on search term density in the found text. So, essentially, I need to find the most search term dense passage in the text. Where a passage is some arbitrary number of characters (say 200 -- but it really doesn't matter).

My first thought is to use .IndexOf() in a loop and build an array of term distances (subtract the index of the found term from the previously found term), then ... what? Add up any two, any three, any four, any five, sequential array elements and use the one with the smallest sum (hence, the smallest distance between search terms).

That seems messy.

Is there an established, better, or more obvious way to do this than what I have come up with?

like image 781
CleverPatrick Avatar asked Nov 11 '08 20:11

CleverPatrick


2 Answers

Although it is implemented in Java, you can see one approach for that problem here: http://rcrezende.blogspot.com/2010/08/smallest-relevant-text-snippet-for.html

like image 60
Rodes Avatar answered Sep 19 '22 23:09

Rodes


I know this thread is way old, but I gave this a try last week and it was a pain in the back side. This is far from perfect, but this is what I came up with.

The snippet generator:

public static string SelectKeywordSnippets(string StringToSnip, string[] Keywords, int SnippetLength)
    {
        string snippedString = "";
        List<int> keywordLocations = new List<int>();

        //Get the locations of all keywords
        for (int i = 0; i < Keywords.Count(); i++)
            keywordLocations.AddRange(SharedTools.IndexOfAll(StringToSnip, Keywords[i], StringComparison.CurrentCultureIgnoreCase));

        //Sort locations
        keywordLocations.Sort();

        //Remove locations which are closer to each other than the SnippetLength
        if (keywordLocations.Count > 1)
        {
            bool found = true;
            while (found)
            {
                found = false;
                for (int i = keywordLocations.Count - 1; i > 0; i--)
                    if (keywordLocations[i] - keywordLocations[i - 1] < SnippetLength / 2)
                    {
                        keywordLocations[i - 1] = (keywordLocations[i] + keywordLocations[i - 1]) / 2;

                        keywordLocations.RemoveAt(i);

                        found = true;
                    }
            }
        }

        //Make the snippets
        if (keywordLocations.Count > 0 && keywordLocations[0] - SnippetLength / 2 > 0)
            snippedString = "... ";
        foreach (int i in keywordLocations)
        {
            int stringStart = Math.Max(0, i - SnippetLength / 2);
            int stringEnd = Math.Min(i + SnippetLength / 2, StringToSnip.Length);
            int stringLength = Math.Min(stringEnd - stringStart, StringToSnip.Length - stringStart);
            snippedString += StringToSnip.Substring(stringStart, stringLength);
            if (stringEnd < StringToSnip.Length) snippedString += " ... ";
            if (snippedString.Length > 200) break;
        }

        return snippedString;

    }

The function which will find the index of all keywords in the sample text

 private static List<int> IndexOfAll(string haystack, string needle, StringComparison Comparison)
    {
        int pos;
        int offset = 0;
        int length = needle.Length;
        List<int> positions = new List<int>();
        while ((pos = haystack.IndexOf(needle, offset, Comparison)) != -1)
        {
            positions.Add(pos);
            offset = pos + length;
        }
        return positions;
    }

It's a bit clumsy in its execution. The way it works is by finding the position of all keywords in the string. Then checking that no keywords are closer to each other than the desired snippet length, so that snippets won't overlap (that's where it's a bit iffy...). And then grabs substrings of the desired length centered around the position of the keywords and stitches the whole thing together.

I know this is years late, but posting just in case it might help somebody coming across this question.

like image 29
yu_ominae Avatar answered Sep 21 '22 23:09

yu_ominae