Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to define a primary key field in a Lucene document to get the best lookup performance?

Tags:

lucene

When creating a document in my Lucene index (v7.2), I add a uid field to it which contains a unique id/key (string):

doc.add(new StringField("uid",uid,Field.Store.YES))

To retrieve that document later on, I create a TermQuery for the given unique id and search for it with an IndexSearcher:

searcher.search(new TermQuery(new Term("uid",uid)),1)

Being a Lucene "novice", I would like to know the following:

  1. How should I improve this approach to get the best lookup performance? Would it, for example, make a difference if I store the unique id as a byte array instead of as a string? Or are there some special codecs or filters that can be used?

  2. What is the time complexity of looking up a document by its unique id? Since the index contains at least one unique term for each document, the lookup times will increase linearly with the number of documents (O(n)), right?

like image 421
xpages-noob Avatar asked Jan 01 '18 15:01

xpages-noob


People also ask

How do you search in Lucene?

Lucene supports fielded data. When performing a search you can either specify a field, or use the default field. The field names and default field is implementation specific. You can search any field by typing the field name followed by a colon ":" and then the term you are looking for.

How does Lucene index search work?

Lucene is able to achieve fast search responses because, instead of searching the text directly, it searches an index instead. This would be the equivalent of retrieving pages in a book related to a keyword by searching the index at the back of a book, as opposed to searching the words in each page of the book.

What is Lucene field?

A field is a section of a Document. Each field has three parts: name, type and value. Values may be text (String, Reader or pre-analyzed TokenStream), binary (byte[]), or numeric (a Number). Fields are optionally stored in the index, so that they may be returned with hits on the document.

What is a Lucene index?

A Lucene Index Is an Inverted Index An index may store a heterogeneous set of documents, with any number of different fields that may vary by a document in arbitrary ways. Lucene indexes terms, which means that Lucene search searches over terms. A term combines a field name with a token.


1 Answers

Theory

There is a blog post about Lucene term index and lookup performance. It clearly reveals all the details of complexity of looking up a document by id. This post is quite old, but nothing was changed since then.

Here is some highlights related to your question:

  • Lucene is a search engine where the minimum element of retrieval is a text term, so this means: binary, number and string fields are represented as strings in the BlockTree terms dictionary.
  • In general, the complexity of lookup depends on the term length: Lucene uses an in-memory prefix-trie index structure to perform a term lookup. Due to restrictions of real-world hardware and software implementations (in order to avoid superfluous disk reads and memory overflow for extremely large tries), Lucene uses a BlockTree structure. This means it stores prefix-trie in small chunks on disk and loads only one chunk at time. This is why it's so important to generate keys in an easy-to-read order. So let's arrange the factors according to the degree of their influence:
    • term's length - more chunks to load
    • term's pattern - to avoid superfluous reads
    • terms count - to reduce chunks count

Algorithms and Complexity

Let term be a single string and let term dictionary be a large set of terms. If we have a term dictionary, and we need to know whether a single term is inside the dictionary, the trie (and minimal deterministic acyclic finite state automaton (DAFSA) as a subclass) is the data structure that can help us. On your question: “Why use tries if a hash lookup can do the same?”, here are a few reasons:

  • The tries can find strings in O(L) time (where L represents the length of a single term). This is a bit faster compared to hash table in the worst case (hash table requires linear scan in case of hash collisions and sophisticated hashing algorithm like MurmurHash3), or similar to a hash table in perfect case.
  • The hash tables can only find terms of a dictionary that exactly match with the single term that we are looking for; whereas the trie allows us to find terms that have a single different character, a prefix in common, a character missing, etc.
  • The trie can provide an alphabetical ordering of the entries by key, so we can enumerate all terms in alphabetical order.
  • The trie (and especially DAFSA) provides a very compact representation of terms with deduplication.

Here is an example of DAFSA for 3 terms: bath, bat and batch: Example of DAFSA Data Structure

In case of key lookup, notice that lowering a single level in the automata (or trie) is done in constant time, and every time that the algorithm lowers a single level in the automata (trie), a single character is cut from the term, so we can conclude that finding a term in a automata (trie) can be done in O(L) time.

like image 176
Ivan Mamontov Avatar answered Jan 02 '23 21:01

Ivan Mamontov