When creating a document in my Lucene index (v7.2), I add a uid
field to it which contains a unique id/key (string):
doc.add(new StringField("uid",uid,Field.Store.YES))
To retrieve that document later on, I create a TermQuery for the given unique id and search for it with an IndexSearcher:
searcher.search(new TermQuery(new Term("uid",uid)),1)
Being a Lucene "novice", I would like to know the following:
How should I improve this approach to get the best lookup performance? Would it, for example, make a difference if I store the unique id as a byte array instead of as a string? Or are there some special codecs or filters that can be used?
What is the time complexity of looking up a document by its unique id? Since the index contains at least one unique term for each document, the lookup times will increase linearly with the number of documents (O(n)), right?
Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. The field names and default field is implementation specific. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.
Lucene is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.
A field is a section of a Document. Each field has three parts: name, type and value. Values may be text (String, Reader or pre-analyzed TokenStream), binary (byte[]), or numeric (a Number). Fields are optionally stored in the index, so that they may be returned with hits on the document.
A Lucene Index Is an Inverted Index An index may store a heterogeneous set of documents, with any number of different fields that may vary by a document in arbitrary ways. Lucene indexes terms, which means that Lucene search searches over terms. A term combines a field name with a token.
There is a blog post about Lucene term index and lookup performance. It clearly reveals all the details of complexity of looking up a document by id. This post is quite old, but nothing was changed since then.
Here is some highlights related to your question:
Let term be a single string and let term dictionary be a large set of terms. If we have a term dictionary, and we need to know whether a single term is inside the dictionary, the trie (and minimal deterministic acyclic finite state automaton (DAFSA) as a subclass) is the data structure that can help us. On your question: “Why use tries if a hash lookup can do the same?”, here are a few reasons:
Here is an example of DAFSA for 3 terms: bath, bat and batch:
In case of key lookup, notice that lowering a single level in the automata (or trie) is done in constant time, and every time that the algorithm lowers a single level in the automata (trie), a single character is cut from the term, so we can conclude that finding a term in a automata (trie) can be done in O(L) time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With