How to define a primary key field in a Lucene document to get the best lookup performance?

Tags:

lucene

When creating a document in my Lucene index (v7.2), I add a uid field to it which contains a unique id/key (string):

doc.add(new StringField("uid",uid,Field.Store.YES))

To retrieve that document later on, I create a TermQuery for the given unique id and search for it with an IndexSearcher:

searcher.search(new TermQuery(new Term("uid",uid)),1)

Being a Lucene "novice", I would like to know the following:

How should I improve this approach to get the best lookup performance? Would it, for example, make a difference if I store the unique id as a byte array instead of as a string? Or are there some special codecs or filters that can be used?
What is the time complexity of looking up a document by its unique id? Since the index contains at least one unique term for each document, the lookup times will increase linearly with the number of documents (O(n)), right?

421

asked Jan 01 '18 15:01

1 Answers

Theory

There is a blog post about Lucene term index and lookup performance. It clearly reveals all the details of complexity of looking up a document by id. This post is quite old, but nothing was changed since then.

Here is some highlights related to your question:

Lucene is a search engine where the minimum element of retrieval is a text term, so this means: binary, number and string fields are represented as strings in the BlockTree terms dictionary.
In general, the complexity of lookup depends on the term length: Lucene uses an in-memory prefix-trie index structure to perform a term lookup. Due to restrictions of real-world hardware and software implementations (in order to avoid superfluous disk reads and memory overflow for extremely large tries), Lucene uses a BlockTree structure. This means it stores prefix-trie in small chunks on disk and loads only one chunk at time. This is why it's so important to generate keys in an easy-to-read order. So let's arrange the factors according to the degree of their influence:
- term's length - more chunks to load
- term's pattern - to avoid superfluous reads
- terms count - to reduce chunks count

Algorithms and Complexity

Let term be a single string and let term dictionary be a large set of terms. If we have a term dictionary, and we need to know whether a single term is inside the dictionary, the trie (and minimal deterministic acyclic finite state automaton (DAFSA) as a subclass) is the data structure that can help us. On your question: “Why use tries if a hash lookup can do the same?”, here are a few reasons:

The tries can find strings in O(L) time (where L represents the length of a single term). This is a bit faster compared to hash table in the worst case (hash table requires linear scan in case of hash collisions and sophisticated hashing algorithm like MurmurHash3), or similar to a hash table in perfect case.
The hash tables can only find terms of a dictionary that exactly match with the single term that we are looking for; whereas the trie allows us to find terms that have a single different character, a prefix in common, a character missing, etc.
The trie can provide an alphabetical ordering of the entries by key, so we can enumerate all terms in alphabetical order.
The trie (and especially DAFSA) provides a very compact representation of terms with deduplication.

Here is an example of DAFSA for 3 terms: bath, bat and batch: Example of DAFSA Data Structure

In case of key lookup, notice that lowering a single level in the automata (or trie) is done in constant time, and every time that the algorithm lowers a single level in the automata (trie), a single character is cut from the term, so we can conclude that finding a term in a automata (trie) can be done in O(L) time.

176

answered Jan 02 '23 21:01

Ivan Mamontov

Related questions
                            
                                The best way to search millions of fuzzy hashes
                            
                                Why does Lucene use arrays instead of hash tables for its inverted index?
                            
                                Poor Lucene in-memory spatial index performance
                            
                                Unable to search a query with symbols in elasticsearch
                            
                                Strategies for keeping a Lucene Index up to date with domain model changes
                            
                                How do I get Lucene (.NET) to highlight correctly with wildcards?
                            
                                What's the most efficient way to retrieve all matching documents from a query in Lucene, unsorted?
                            
                                Lucene spatial, accuracy
                            
                                Sitecore Lucene: re-index child (or parent) items on updating item
                            
                                Boosting Lucene Terms When Building the Index
                            
                                NHibernate Search - Multiple Web Servers
                            
                                elastic search double facet
                            
                                How is the new optional BlockPostingsFormat enabled for Lucene 4.0?
                            
                                Is there a way to remove the calculation of length norms for fields in elastic search?
                            
                                Lucene vs SQLite Full Text Search for Android Application
                            
                                Position of document in result set in Solr
                            
                                OrientDB: FullText indexes vs Lucene FullText indexes
                            
                                Is it possible to returned the analyzed fields in an ElasticSearch >2.0 search?
                            
                                Sitecore Lucene index search term with space match same word without space
                            
                                Matching entire sentence with spaces in lucene BooleanQuery

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With