Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing full-text search algorithm in C# / Entity Framework - where to start?

I need to search a potentially large collection of sentences, and I have no idea where to start.

In summary a user will submit a search phrase, for example "how do I delete my account", I then need to go to the db and do a match with the words provided.

At the moment I am thinking of doing something like the following:

  • Split phrase to individual words
  • Remove very common words (and, if, etc)
  • Somehow order words by priority (no idea how to do this yet)
  • Using EF loop through words, doing a String.Contains on each db record and word
  • If no results found, remove some of the lower priority words and search again
  • Repeat

Could anyone point me in the right direction? Also if anyone knows any libraries for doing this sort of work that would be great.

Cheers

like image 666
jcvandan Avatar asked Dec 15 '11 15:12

jcvandan


3 Answers

As for prioritizing words, simple but pretty effective solution is to sort them by their popularity (maybe popularity index could be create based on articles in your database), so that words that are rare in your texts are more important, this way you can boost words that are less general.

Other problem here is the fact, that you might have words in different forms, like past/future tense, therefore you might be interested in stemming them, one tool that was ported to c# is Snowball project as far as I remember.

As for doing second part of your problem, looping through words might be very ineffective, I think you should consider using some indexing libraries / solutions. One, popular for .net is Lucene.Net. It basically creates reversed index, which maps certain phrases (like words) to articles that contain them, which allows you to quickly find all occurrences of given words in your texts. Similar approached could be implemented by yourself inside your database

like image 187
Marcin Deptuła Avatar answered Nov 11 '22 23:11

Marcin Deptuła


Just in case anyone comes across this and wondered what I used in the end, I ended up using Lucene.NET. It's fantastic, very easy to set up and use considering it so powerful and adds such great functionality. One thing I would say though is that the documentation isn't great. However, I did find a series of tutorials here that is a good introduction. I spent a morning going through these articles and I had ridiculously fast full text indexing/searching in my app!

like image 40
jcvandan Avatar answered Nov 11 '22 22:11

jcvandan


Use SQL server full text search capability and wrap the query using full text search to stored procedure. Execute the stored procedure either through ADO.NET or EF.

like image 2
Ladislav Mrnka Avatar answered Nov 11 '22 23:11

Ladislav Mrnka