What are norms in Lucene

2 Answers

A norm is part of the calculation of a score. The norm could be calculated however you like, really. The main thing that sets the norm apart, is it's calculated at index-time. Generally, other factors influencing score are calculated at query time, based on how well the document matches the query. The norm saves on query performance by being stored along with the document, instead.

The standard implementation can be found, and well described, in Lucene's TFIDFSimilarity. There, it is the product of the set field boost (or the product of all fields boosts, if multiple have been set on the field) and "lengthNorm" (which is a calculated factor designed to weigh matches on shorter documents more heavily). Neither of these is dependent on the makeup of the query, and so are good choices to be calculated and stored at index time instead.

They are then stored in a compressed, and highly lossy, single-byte format (with approx. 1 significant decimal digit of accuracy).

answered Oct 23 '22 15:10

femtoRgon

When you index, process, your source information you will treat some documents and fields as more important than others.

For example, the task is to spy on your colleagues' emails. A word match in the title field is more important than a word match in the body field. We do this by multiplying the number of matches in the title field by a number larger than we use for body field matches.

Example Indexable Email Records

+----+-------------+--------------+
| ID | Title       | Body         |
|----+-------------+--------------|
| 7  | Back Monday | Ben was sick |
| 8  | I'm sick    | cover for me |
| 9  | Help        | I am stuck   |
+----+-------------+--------------+

So, searching for 'sick' and multiplying a title match by 4 and body match by 2 and ordering highest score first - the documents are ranked ID 8 first and ID 7 second (see table 1 below).

Table 1: Matches for the word 'sick' ordered by score (descending)

+----+---------+--------+-----------------------+
| Id | Title   | Body   | Score                 |
|    | Matches | Matches|                       |
|----+---------+--------+-----------------------|
| 8  | 1       | 0      | (1 * 4) + (0 * 2) = 4 |
| 7  | 0       | 1      | (0 * 4) + (1 * 2) = 2 |
+----+---------+--------+-----------------------+

These numbers, 4 and 2, we are multiplying the matches with are the norms.

answered Oct 23 '22 15:10

notapatch

Related questions
                            
                                Java Lucene NGramTokenizer
                            
                                Migrating from Hit/Hits to TopDocs/TopDocCollector
                            
                                Error 404: Prob accessing /solr/update. Reason: Not Found
                            
                                Why do I need a tokenizer for each language? [closed]
                            
                                Unable to find schema.xml file in solr 6.0,so to configure it,am i supposed to add a new file,or it will happen automatically?
                            
                                How to run Luke(Lucene tool)?
                            
                                the store attribute of a lucene field
                            
                                How to implement auto suggest using Lucene's new AnalyzingInfixSuggester API?
                            
                                Questions on Upgrading Lucene from 2.2 to 2.9 to 3.1
                            
                                Elastic Search Interaction of Highlights with Synonym Filter
                            
                                Solr faceting: Inconsistent JSON formatting
                            
                                php mysql fulltext search: lucene, sphinx, or?
                            
                                Syncing Lucene.net indexes across multiple app servers
                            
                                Do multiple Solr shards on a single machine improve performance?
                            
                                Cassandra or SOLR? What gives better performance to frond end read queries?
                            
                                What's the difference between Lucene StandardAnalyzer and EnglishAnalyzer?
                            
                                Using RAMDirectory
                            
                                Document search on partial words
                            
                                What is the difference between a phrase query and using a shingle filter?
                            
                                ElasticSearch default scoring mechanism

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are norms in Lucene

Tags:

lucene

Nick

People also ask