Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I index and search text files in Lucene 3.0.2?

I am newbie in Lucene, and I'm having some problems creating simple code to query a text file collection.

I tried this example, but is incompatible with the new version of Lucene.

UDPATE: This is my new code, but it still doesn't work yet.

like image 328
celsowm Avatar asked Nov 03 '10 20:11

celsowm


People also ask

How do you search in Lucene?

Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. The field names and default field is implementation specific. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

How does Lucene index search work?

Lucene is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.


2 Answers

Lucene is a quite big topic with a lot of classes and methods to cover, and you normally cannot use it without understanding at least some basic concepts. If you need a quickly available service, use Solr instead. If you need full control of Lucene, go on reading. I will cover some core Lucene concepts and classes, that represent them. (For information on how to read text files in memory read, for example, this article).

Whatever you are going to do in Lucene - indexing or searching - you need an analyzer. The goal of analyzer is to tokenize (break into words) and stem (get base of a word) your input text. It also throws out the most frequent words like "a", "the", etc. You can find analyzers for more then 20 languages, or you can use SnowballAnalyzer and pass language as a parameter.
To create instance of SnowballAnalyzer for English you this:

Analyzer analyzer = new SnowballAnalyzer(Version.LUCENE_30, "English");

If you are going to index texts in different languages, and want to select analyzer automatically, you can use tika's LanguageIdentifier.

You need to store your index somewhere. There's 2 major possibilities for this: in-memory index, which is easy-to-try, and disk index, which is the most widespread one.
Use any of the next 2 lines:

Directory directory = new RAMDirectory();   // RAM index storage
Directory directory = FSDirectory.open(new File("/path/to/index"));  // disk index storage

When you want to add, update or delete document, you need IndexWriter:

IndexWriter writer = new IndexWriter(directory, analyzer, true, new IndexWriter.MaxFieldLength(25000));

Any document (text file in your case) is a set of fields. To create document, which will hold information about your file, use this:

Document doc = new Document();
String title = nameOfYourFile;
doc.add(new Field("title", title, Field.Store.YES, Field.Index.ANALYZED));  // adding title field
String content = contentsOfYourFile;
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED)); // adding content field
writer.addDocument(doc);  // writing new document to the index

Field constructor takes field's name, it's text and at least 2 more parameters. First is a flag, that show whether Lucene must store this field. If it equals Field.Store.YES you will have possibility to get all your text back from the index, otherwise only index information about it will be stored.
Second parameter shows whether Lucene must index this field or not. Use Field.Index.ANALYZED for any field you are going to search on.
Normally, you use both parameters as shown above.

Don't forget to close your IndexWriter after the job is done:

writer.close();

Searching is a bit tricky. You will need several classes: Query and QueryParser to make Lucene query from the string, IndexSearcher for actual searching, TopScoreDocCollector to store results (it is passed to IndexSearcher as a parameter) and ScoreDoc to iterate through results. Next snippet shows how this all is composed:

IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_30, "content", analyzer);
Query query = parser.parse("terms to search");
TopScoreDocCollector collector = TopScoreDocCollector.create(HOW_MANY_RESULTS_TO_COLLECT, true);
searcher.search(query, collector);

ScoreDoc[] hits = collector.topDocs().scoreDocs;
// `i` is just a number of document in Lucene. Note, that this number may change after document deletion 
for (int i = 0; i < hits.length; i++) {
    Document hitDoc = searcher.doc(hits[i].doc);  // getting actual document
    System.out.println("Title: " + hitDoc.get("title"));
    System.out.println("Content: " + hitDoc.get("content"));
    System.out.println();
}

Note second argument to the QueryParser constructor - it is default field, i.e. field that will be searched if no qualifier was given. For example, if your query is "title:term", Lucene will search for a word "term" in field "title" of all docs, but if your query is just "term" if will search in default field, in this case - "contents". For more info see Lucene Query Syntax.
QueryParser also takes analyzer as a last argument. This must be same analyzer as you used to index your text.

The last thing you must know is a TopScoreDocCollector.create first parameter. It is just a number that represents how many results you want to collect. For example, if it is equal 100, Lucene will collect only first (by score) 100 results and drop the rest. This is just an act of optimization - you collect best results, and if you're not satisfied with it, you repeat search with a larger number.

Finally, don't forget to close searcher and directory to not loose system resources:

searcher.close();
directory.close();

EDIT: Also see IndexFiles demo class from Lucene 3.0 sources.

like image 171
ffriend Avatar answered Oct 20 '22 03:10

ffriend


package org.test;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;


import org.apache.lucene.queryParser.*;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopScoreDocCollector;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.LockObtainFailedException;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.index.CorruptIndexException;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.RAMDirectory;
import org.apache.lucene.util.Version;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

public class LuceneSimple {

 private static void addDoc(IndexWriter w, String value) throws IOException {
  Document doc = new Document();
  doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
  w.addDocument(doc);
 }



 public static void main(String[] args) throws CorruptIndexException, LockObtainFailedException, IOException, ParseException {

     File dir = new File("F:/tmp/dir");

  StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_30);

  Directory index = new RAMDirectory();
  //Directory index = FSDirectory.open(new File("lucDirHello") );


  IndexWriter w = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);

  w.setRAMBufferSizeMB(200);

  System.out.println(index.getClass() + " RamBuff:" + w.getRAMBufferSizeMB() );

  addDoc(w, "Lucene in Action");
     addDoc(w, "Lucene for Dummies");
     addDoc(w, "Managing Gigabytes");
     addDoc(w, "The Art of Computer Science");
     addDoc(w, "Computer Science ! what is that ?");


     Long N = 0l;

     for( File f : dir.listFiles() ){
      BufferedReader br = new BufferedReader( new FileReader(f) );
      String line = null;
      while( ( line = br.readLine() ) != null ){
       if( line.length() < 140 ) continue;      
       addDoc(w, line);
       ++N;
      }
      br.close();
     }

     w.close();

     // 2. query
     String querystr = "Computer";

     Query q = new QueryParser( Version.LUCENE_30, "title", analyzer ).parse(querystr);


     //search
     int hitsPerPage = 10;

     IndexSearcher searcher = new IndexSearcher(index, true);

     TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);

     searcher.search(q, collector);

     ScoreDoc[] hits = collector.topDocs().scoreDocs;

     System.out.println("Found " + hits.length + " hits.");
     for(int i=0;i<hits.length;++i) {
       int docId = hits[i].doc;
       Document d = searcher.doc(docId);
       System.out.println((i + 1) + ". " + d.get("title"));
     }


     searcher.close();

 }

}
like image 23
smartnut007 Avatar answered Oct 20 '22 03:10

smartnut007