Writing an Inverted Index in C# for an information retrieval application

Tags:

I am writing an in-house application that holds several pieces of text information as well as a number of pieces of data about these pieces of text. These pieces of data will be held within a database (SQL Server, although this could change) in order of entry.

I'd like to be able to search for the most relevant of these pieces of information, with the most relevant of these to be at the top. I originally looked into using SQL Server Full-Text Search but it's not as flexible for my other needs as I had hoped so it seems that I'll need to develop my own solution to this.

From what I understand what is needed is an inverted index, then for the contents of said inverted index to be restored and modified based on the results of the additional information held (although for now this can be left for a later date as I just want the inverted index to index the main text from the database table/strings provided).

I've had a crack at writing this code in Java using a Hashtable with the key as the words and the value as a list of the occurrences of the word but in all honesty I'm still rather new at C# and have only really used things like DataSets and DataTables when handling information. If requested I'll upload the Java code soon once I've cleared this laptop of viruses.

If given a set of entries from a table or from a List of Strings, how could one create an inverted index in C# that will preferably save into a DataSet/DataTable?

EDIT: I forgot to mention that I have already tried Lucene and Nutch, but require my own solution as modifying Lucene to meet my needs would take far longer than writing an inverted index. I'll be handling a lot of meta-data that'll also need handling once the basic inverted index is completed, so all I require for now is a basic full-text search on one area using the inverted index. Finally, working on an inverted index isn't something I get to do every day so it'd be great to have a crack at it.

713

asked Jan 21 '10 15:01

Mike B

1 Answers

Here's a rough overview of an approach I've used successfully in C# in the past:

 struct WordInfo
 {
     public int position;
     public int fieldID;
 }

 Dictionary<string,List<WordInfo>> invertedIndex=new Dictionary<string,List<WordInfo>>();

       public void BuildIndex()
       {
            foreach (int  fieldID in GetDatabaseFieldIDS())
            {    
                string textField=GetDatabaseTextFieldForID(fieldID);

                string word;

                int position=0;

                while(GetNextWord(textField,out word,ref position)==true)
                {
                     WordInfo wi=new WordInfo();

                     if (invertedIndex.TryGetValue(word,out wi)==false)
                     {
                         invertedIndex.Add(word,new List<WordInfo>());
                     }

                     wi.Position=position;
                     wi.fieldID=fieldID;
                     invertedIndex[word].Add(wi);

                }

            }
        }

Notes:

GetNextWord() iterates through the field and returns the next word and position. To implement it look at using string.IndexOf() and char character type checking methods (IsAlpha etc).

GetDatabaseTextFieldForID() and GetDatabaseFieldIDS() are self explanatory, implement as required.

192

answered Sep 21 '22 22:09

Ash

Related questions
                            
                                c# WPF transparency over Winform controls
                            
                                Outlook Interop, Mail Formatting
                            
                                Why can't I step into this line?
                            
                                Skype Addon in C#
                            
                                Can someone provide a quick App.config/Web.config tutorial?
                            
                                Is it possible to programmatically set the user account for a windows service?
                            
                                When to call Dispose() method in WPF application
                            
                                Does changing the culture of a threadpool thread affect it when it gets returned back to the pool?
                            
                                memory leak when calling unmanaged code from managed code in Windows 7
                            
                                How can I speed up Visual Studio builds to match MSBuild parallel performance?
                            
                                Track user activity/actions for an asp.net mvc website?
                            
                                NVelocity - Only show row if not null
                            
                                Powershell - Get Variable from C# Cmdlet
                            
                                Logging from multiple processes to same file using Enterprise Library 4.1
                            
                                Fluent NHibernate, varbinary(max) and SQLite
                            
                                Unable to open serial port in .NET
                            
                                SmtpClient.SendAsync - How to stop the application exiting before the callback is triggered?
                            
                                HTML Agility Pack - Select nodes after specific node
                            
                                How to show value printed by sql query in message box
                            
                                Application Design - Database Tables and Interfaces

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Writing an Inverted Index in C# for an information retrieval application

Tags:

c#

search

data-structures

full-text-search

Mike B

People also ask

1 Answers

Ash

Recent Activity

Donate For Us