Text indexing algorithm

Tags:

I am writing a C# winform application for an archiving system. The system has a huge database where some tables would have more than 1.5 million records. What i need is an algorithm that indexes the content of these records. Mainly, the files are Microsoft office, PDF and TXT documents. anyone can help? whether with ideas, links, books or codes, I appreciate it :)

example: if i search for the word "international" in a certain folder in the database, i get all the files that contain that word ordered by a certain criteria such as relevance, modifying date...etc

784

asked Dec 23 '10 01:12

Majd

2 Answers

You need to create, what is known as an inverted index - which is at the core of how search engines work (a la Google). Apache Lucene is arguably the best library for inverted indexing. You have 2 options:

Lucene.net - a .NET port of the Java Lucene library.
Apache Solr - a full-fledged search server built using Lucene libs and easily integrable into your .NET application because it has a RESTful API. Comes out-of-the-box with several features such as caching, scaling, spell-checking, etc. You can make life easier for your app-to-Solr interaction using the excellent SolrNet library.
Apache Tika offers a very extensive data/metadata extraction toolkit working with PDFs, HTMLs, MS Office docs etc. A simpler option would be to the IFilter API. See this article for more details.

144

answered Oct 19 '22 12:10

Mikos

It looks like you need two things. Firstly, you need a system which actually performs the indexing. For this, you can go with Lucene, or Apache Solr as Mikos mentioned. You also might want to check out Sphinx which is another full text search engine. You could also use the full text features built into your database. Both SQL Server and MySQL have full text indexing capabilities. As do many other databases. The second thing you need is a way to get the text out of the files. For things like txt files, and HTML files, this is easy because most full text search engines will accept them as regular text. For more complicated binary documents like MS Word or PDF, you'll have to find another way to get the text out of them.

answered Oct 19 '22 11:10

Kibbee

Related questions
                            
                                Create Sandbox C# [closed]
                            
                                save images in webbrowser control without redownloading them from the internet
                            
                                How to use Transaction in Entity Framework?
                            
                                Why does it not allowed to use try-catch statement without {}?
                            
                                PropertyInfo SetValue and nulls
                            
                                Adding check boxes to each row on MVCcontrib Grid
                            
                                Auto-entering Password In Sn.exe
                            
                                Custom serialization with DataContractSerializer
                            
                                Calling a jquery function from code behind in asp.net doesn't seem to work
                            
                                HttpApplicationState not available in an MVC controller
                            
                                Using generic arguments on WPF Window defined in XAML
                            
                                How can I tell if two polygons intersect?
                            
                                MeasureString and DrawString difference
                            
                                Windows 7 doesn't allow me edit files in Common Application Data folder
                            
                                Splitting a string with uppercase [duplicate]
                            
                                calling c function from C#
                            
                                Multiple Calls to the same Web Service Blocking?
                            
                                SevenZipSharp - compress memory stream
                            
                                Can(should?) Lazy<T> be used as a caching technique?
                            
                                Is Autofac ContainerBuilder.Build an expensive operation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Text indexing algorithm

Tags:

c#

database

indexing

winforms

Majd

People also ask

2 Answers

Mikos

Kibbee

Recent Activity

Donate For Us