Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Document search on partial words

I am looking for a document search engine (like Xapian, Whoosh, Lucene, Solr, Sphinx or others) which is capable of searching partial terms.

For example when searching for the term "brit" the search engine should return documents containing either "britney" or "britain" or in general any document containing a word matching r*brit*

Tangentially, I noticed most engines use TF-IDF (Term frequency-Inverse document frequency) or its derivatives which are based on full terms and not partial terms. Are there any other techniques that have been successfully implemented besides TF-IDF for document retrieval?

like image 439
GeneralBecos Avatar asked Apr 26 '11 05:04

GeneralBecos


People also ask

How to search for words or phrases in a Word document?

If you have a Word file that can also be searched for a specific word or phrase. Step 2. Search for Words or Phrases in PDF Now, click the "Search" icon in the left vertical bar. A search window will appear. Type the word or phrase you are looking for, select preferences such as " Match Case " and " Match Whole Word ", and press " Enter ".

How to search keywords in PDF without any hassle?

So, this is how to search keywords in PDF without any hassle. Now, let's move to the second method and learn how to search words in a PDF using the alternative method. Step 1. Ctrl + F Launch PDFelement and open the desired file using the " File " menu or clicking on the " Open File " button.

How do I find a specific word or phrase in pdfelement?

Step 1. Ctrl + F Launch PDFelement and open the desired file using the " File " menu or clicking on the " Open File " button. Once the file is opened, press " Ctrl + F " and a " Find " message bar will appear. You can type any word or phrase in the search bar and find it in the document.

How to search words in scanned PDF images?

In the " Searchable Text Image " option, you can just find and read the desired text while in the " Editable Text " option, you cannot only find the text but can also edit and replace it. You can also change the language and customize the pages if needed. Step 2. Search Words in Scanned PDF


1 Answers

With lucene you would be able to implement this in several ways:

1.) You can use wildcard queries *brit* (You would have to set your query parser to allow leading wild cards)

2.) You can create an additional field containing N-Grams of all the terms. This would result in larger indexes, but would be in many cases faster (search speed).

3.) You can use fuzzy search to handle typing mistakes in the query. e.g. someone typed britnei but wanted to find britney.

For wildcard queries and fuzzy search have a look at the query syntax docs.

like image 77
csupnig Avatar answered Oct 07 '22 04:10

csupnig