Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I generate a unique id using Lucene?

Tags:

java

lucene

I am using Lucene to store (as well as index) various documents.

Each document needs a persistent unique identifier (to be used as part of a URL).

If I was using a SQL database, I could use an integer primary key auto_increment (or similar) field to automatically generate a unique id for every record that was added.

Is there any way of doing this with Lucene?

I am aware that documents in Lucene are numbered, but have noted that these numbers are reallocated over time.

(I'm using the Java version of Lucene 3.0.3.)

like image 650
dave4420 Avatar asked Feb 20 '11 18:02

dave4420


People also ask

How does Lucene Query work?

Simply put, Lucene uses an “inverted indexing” of data – instead of mapping pages to keywords, it maps keywords to pages just like a glossary at the end of any book. This allows for faster search responses, as it searches through an index, instead of searching through text directly.

What is the Lucene data structure?

Lucene uses a well-known index structure called an inverted index. Quite simply, and probably unsurprisingly, an inverted index is an inside-out arrangement of documents in which terms take center stage. Each term refers to the documents that contain it.


2 Answers

As larsmans said, you need to store this in a separate field. I suggest that you make the field indexed as well as stored, and index it using a KeywordAnalyzer. You can keep a counter in memory and update it for each new document.

What remains is the problem of persistence - how to store the maximal id when the Lucene process stops. One possibility is to use a text file which saves the maximal id.

I believe Flexible Indexing will allow you to add the maximal id to the index as a "global" field. If you are willing to work with Lucene's trunk, you can try flexible indexing to see whether it fits the bill.

like image 124
Yuval F Avatar answered Oct 11 '22 12:10

Yuval F


For similar situations, I use following algorithm (has nothing to do with Lucene, but you can use it anyway).

  • Create new AtomicLong. Start with initial value obtained from System.currentTimeMillis() or System.nanoTime()
  • Each next ID is generated by calling .incrementAndGet or .getAndIncrement on that AtomicLong.
  • if the system is restarted, AtomicLong is again initialized to current timestamp during the startup.

Pros: simple, effective, thread-safe, non-blocking. If you need clustered id support, just add space for hi/lo algorithm on top of existing long or sacrifice some high bytes.

Cons: does not work if the frequency of adding new entities if more than 1/ms (for System.currentTimeMillis()) or 1/ns (for System.nanoTime()). Does not tolerate clock abnormalities.

Can consider using UUID as yet another alternative. Probability of a duplicate in UUID is virtually non-existant.

like image 42
mindas Avatar answered Oct 11 '22 10:10

mindas