Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to iterate through documents stored in Lucene Index?

I have some documents stored in a Lucene index with a docId field. I want to get all docIds stored in the index. There is also a problem. Number of documents is about 300 000 so I would prefer to get this docIds in chunks of size 500. Is it possible to do so?

like image 694
Eugeniu Torica Avatar asked Feb 22 '10 15:02

Eugeniu Torica


People also ask

What does Lucene index do?

In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.

Why is Lucene so fast?

Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.

How does Lucene store data?

But the more general answer is that they use/implement a Inverted Index. The specifics of how Lucene stores it you can find in file formats (as milan said). But the general idea is that they store a Inverted Index data structure and other auxiliar data structures to help answer queries quickly.

Is Lucene a database?

Lucene is not a database — as I mentioned earlier, it's just a Java library.


2 Answers

IndexReader reader = // create IndexReader for (int i=0; i<reader.maxDoc(); i++) {     if (reader.isDeleted(i))         continue;      Document doc = reader.document(i);     String docId = doc.get("docId");      // do something with docId here... } 
like image 136
bajafresh4life Avatar answered Sep 22 '22 07:09

bajafresh4life


Lucene 4

Bits liveDocs = MultiFields.getLiveDocs(reader); for (int i=0; i<reader.maxDoc(); i++) {     if (liveDocs != null && !liveDocs.get(i))         continue;      Document doc = reader.document(i); } 

See LUCENE-2600 on this page for details: https://lucene.apache.org/core/4_0_0/MIGRATE.html

like image 40
bcoughlan Avatar answered Sep 22 '22 07:09

bcoughlan