I have some documents stored in a Lucene index with a docId field. I want to get all docIds stored in the index. There is also a problem. Number of documents is about 300 000 so I would prefer to get this docIds in chunks of size 500. Is it possible to do so?
In a nutshell, when lucene indexes a document it breaks it down into a number of terms. It then stores the terms in an index file where each term is associated with the documents that contain it. You could think of it as a bit like a hashtable.
Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.
But the more general answer is that they use/implement a Inverted Index. The specifics of how Lucene stores it you can find in file formats (as milan said). But the general idea is that they store a Inverted Index data structure and other auxiliar data structures to help answer queries quickly.
Lucene is not a database — as I mentioned earlier, it's just a Java library.
IndexReader reader = // create IndexReader for (int i=0; i<reader.maxDoc(); i++) { if (reader.isDeleted(i)) continue; Document doc = reader.document(i); String docId = doc.get("docId"); // do something with docId here... }
Lucene 4
Bits liveDocs = MultiFields.getLiveDocs(reader); for (int i=0; i<reader.maxDoc(); i++) { if (liveDocs != null && !liveDocs.get(i)) continue; Document doc = reader.document(i); }
See LUCENE-2600 on this page for details: https://lucene.apache.org/core/4_0_0/MIGRATE.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With