Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Text indexing algorithm

I am writing a C# winform application for an archiving system. The system has a huge database where some tables would have more than 1.5 million records. What i need is an algorithm that indexes the content of these records. Mainly, the files are Microsoft office, PDF and TXT documents. anyone can help? whether with ideas, links, books or codes, I appreciate it :)

example: if i search for the word "international" in a certain folder in the database, i get all the files that contain that word ordered by a certain criteria such as relevance, modifying date...etc

like image 784
Majd Avatar asked Dec 23 '10 01:12

Majd


People also ask

How do I create a text search index?

Creating a text search index requires one of following authorization levels: CONTROL privilege on the index table. INDEX privilege on the index table with either the IMPLICIT_SCHEMA authority on the database or the CREATEIN privilege on the index table schema. DBADM with DATAACCESS authority.

How do full-text indexes work?

A full-text index stores information about significant words and their location within one or more columns of a database table. A full-text index is a special type of token-based functional index that is built and maintained by the Full-Text Engine for SQL Server.

How does MongoDB text index work?

MongoDB provides text indexes to support text search queries on string content. Text indexes can include any field whose value is a string or an array of string elements. A collection can only have one text search index, but that index can cover multiple fields.


2 Answers

You need to create, what is known as an inverted index - which is at the core of how search engines work (a la Google). Apache Lucene is arguably the best library for inverted indexing. You have 2 options:

  1. Lucene.net - a .NET port of the Java Lucene library.

  2. Apache Solr - a full-fledged search server built using Lucene libs and easily integrable into your .NET application because it has a RESTful API. Comes out-of-the-box with several features such as caching, scaling, spell-checking, etc. You can make life easier for your app-to-Solr interaction using the excellent SolrNet library.

  3. Apache Tika offers a very extensive data/metadata extraction toolkit working with PDFs, HTMLs, MS Office docs etc. A simpler option would be to the IFilter API. See this article for more details.

like image 144
Mikos Avatar answered Oct 19 '22 12:10

Mikos


It looks like you need two things. Firstly, you need a system which actually performs the indexing. For this, you can go with Lucene, or Apache Solr as Mikos mentioned. You also might want to check out Sphinx which is another full text search engine. You could also use the full text features built into your database. Both SQL Server and MySQL have full text indexing capabilities. As do many other databases. The second thing you need is a way to get the text out of the files. For things like txt files, and HTML files, this is easy because most full text search engines will accept them as regular text. For more complicated binary documents like MS Word or PDF, you'll have to find another way to get the text out of them.

like image 35
Kibbee Avatar answered Oct 19 '22 11:10

Kibbee