I am using Lucene to store (as well as index) various documents.
Each document needs a persistent unique identifier (to be used as part of a URL).
If I was using a SQL database, I could use an integer primary key auto_increment
(or similar) field to automatically generate a unique id for every record that was added.
Is there any way of doing this with Lucene?
I am aware that documents in Lucene are numbered, but have noted that these numbers are reallocated over time.
(I'm using the Java version of Lucene 3.0.3.)
Simply put, Lucene uses an “inverted indexing” of data – instead of mapping pages to keywords, it maps keywords to pages just like a glossary at the end of any book. This allows for faster search responses, as it searches through an index, instead of searching through text directly.
Lucene uses a well-known index structure called an inverted index. Quite simply, and probably unsurprisingly, an inverted index is an inside-out arrangement of documents in which terms take center stage. Each term refers to the documents that contain it.
As larsmans said, you need to store this in a separate field. I suggest that you make the field indexed as well as stored, and index it using a KeywordAnalyzer. You can keep a counter in memory and update it for each new document.
What remains is the problem of persistence - how to store the maximal id when the Lucene process stops. One possibility is to use a text file which saves the maximal id.
I believe Flexible Indexing will allow you to add the maximal id to the index as a "global" field. If you are willing to work with Lucene's trunk, you can try flexible indexing to see whether it fits the bill.
For similar situations, I use following algorithm (has nothing to do with Lucene, but you can use it anyway).
AtomicLong
. Start with initial value obtained from System.currentTimeMillis()
or System.nanoTime()
.incrementAndGet
or .getAndIncrement
on that AtomicLong
.AtomicLong
is again initialized to current timestamp during the startup.Pros: simple, effective, thread-safe, non-blocking. If you need clustered id support, just add space for hi/lo algorithm on top of existing long or sacrifice some high bytes.
Cons: does not work if the frequency of adding new entities if more than 1/ms (for System.currentTimeMillis()
) or 1/ns (for System.nanoTime()
). Does not tolerate clock abnormalities.
Can consider using UUID as yet another alternative. Probability of a duplicate in UUID is virtually non-existant.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With