I don't understand what they are, and would really appreciate a simple explanation showing what value they bring to the world without too much implementation detail of how they work.
Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.
Lucene uses a well-known index structure called an inverted index. Quite simply, and probably unsurprisingly, an inverted index is an inside-out arrangement of documents in which terms take center stage. Each term refers to the documents that contain it.
By default, Lucene uses the TF-IDF and BM25 algorithms. Relevance is scored when data is written and searched. Scoring during data writing is called index-time boosting.
Lucene is a full-text search library in Java which makes it easy to add search functionality to an application or website. It does so by adding content to a full-text index.
A norm is part of the calculation of a score. The norm could be calculated however you like, really. The main thing that sets the norm apart, is it's calculated at index-time. Generally, other factors influencing score are calculated at query time, based on how well the document matches the query. The norm
saves on query performance by being stored along with the document, instead.
The standard implementation can be found, and well described, in Lucene's TFIDFSimilarity. There, it is the product of the set field boost (or the product of all fields boosts, if multiple have been set on the field) and "lengthNorm" (which is a calculated factor designed to weigh matches on shorter documents more heavily). Neither of these is dependent on the makeup of the query, and so are good choices to be calculated and stored at index time instead.
They are then stored in a compressed, and highly lossy, single-byte format (with approx. 1 significant decimal digit of accuracy).
When you index, process, your source information you will treat some documents and fields as more important than others.
For example, the task is to spy on your colleagues' emails. A word match in the title field is more important than a word match in the body field. We do this by multiplying the number of matches in the title field by a number larger than we use for body field matches.
+----+-------------+--------------+
| ID | Title | Body |
|----+-------------+--------------|
| 7 | Back Monday | Ben was sick |
| 8 | I'm sick | cover for me |
| 9 | Help | I am stuck |
+----+-------------+--------------+
So, searching for 'sick' and multiplying a title match by 4 and body match by 2 and ordering highest score first - the documents are ranked ID 8 first and ID 7 second (see table 1 below).
+----+---------+--------+-----------------------+
| Id | Title | Body | Score |
| | Matches | Matches| |
|----+---------+--------+-----------------------|
| 8 | 1 | 0 | (1 * 4) + (0 * 2) = 4 |
| 7 | 0 | 1 | (0 * 4) + (1 * 2) = 2 |
+----+---------+--------+-----------------------+
These numbers, 4 and 2, we are multiplying the matches with are the norms.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With