Elasticsearch: How to store term vectors

Tags:

morelikethis

I am working on a project where I heavily use Elasticsearch and leverage the moreLikeThis query to implement some features. The official documentation for the MLT query states the following:

In order to speed up analysis, it could help to store term vectors at index time, but at the expense of disk usage.

In the **How it works* section. The idea now is then to tune the mapping so store the pre calculated term vectors. The problem is that it seems unclear from the documentation how exactly this should be done. On one side, in the MLT documentation, they provide and example mapping that looks like this:

curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
  "mappings": {
    "movies": {
      "properties": {
        "title": {
          "type": "string",
          "term_vector": "yes"
         },
         "description": {
          "type": "string"
        },
        "tags": {
          "type": "string",
          "fields" : {
            "raw": {
              "type" : "string",
              "index" : "not_analyzed",
              "term_vector" : "yes"
            }
          }
        }
      }
    }
  }
}

On the other side, in the Term Vectors documentation, they provide a mapping in the Example 1 section that looks like this

curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
  "mappings": {
    "tweet": {
      "properties": {
        "text": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "index_analyzer" : "fulltext_analyzer"
         },
         "fullname": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "index_analyzer" : "fulltext_analyzer"
        }
      }
    }
    ....

This should create an index that stores term vectors, payloads etc.

Now the question is: which of the mapping should be used? Is it a flaw in the documentation or am I missing something?

698

asked Aug 28 '15 10:08

Nicola Miotto

1 Answers

You are right it doesn't seem to be explicitly mentioned in the current version of documents however in the upcoming release 2.0 documents there is a more detailed explanation.

Term vectors contain information about the terms produced by the analysis process, including:

a list of terms.

the position (or order) of each term.

the start and end character offsets mapping the term to its origin in the original string.

These term vectors can be stored so that they can be retrieved for a particular document.

The term_vector setting accepts:

no: No term vectors are stored. (default)

yes: Just the terms in the field are stored

with_positions: Terms and positions are stored

with_offsets: Terms and character offsets are stored

with_positions_offsets: Terms, positions, and character offsets are stored

answered Oct 28 '22 13:10

keety

Related questions
                            
                                How to write data in Elasticsearch from Pyspark?
                            
                                Django-Haystack using Amazon Elasticsearch hosting with IAM credentials
                            
                                serialize query from Nest client elastic search 2.3
                            
                                Scoring documents by both textual match and distance to a point
                            
                                Delete a document with a forward-slash in id from Elasticsearch
                            
                                Check if Elasticsearch has finished indexing
                            
                                Exact-match, case-insensitive match without normalization in Elasticsearch 6.2
                            
                                Is it better to store nested data or use flat structure with unique names in JSON?
                            
                                Difference between Weight and boost in Elasticsearch
                            
                                Elasticsearch - Want to sort by field in all indices where that particular field available or not if not then avoid it
                            
                                Elastic search Query terms and scoring
                            
                                logstash file input configuration
                            
                                Elasticsearch strange behaviour for queries straight after insertion
                            
                                Determining which words were matched in a fuzzy search
                            
                                Elasticsearch query with nested aggregations causing out of memory
                            
                                Elasticsearch NEST - Filtering on multilevel nested types
                            
                                Best way to index arbitrary attribute value pairs on elastic search
                            
                                How to prevent Elasticsearch from index throttling?
                            
                                Nest: how are you supposed to deal with the highlights in c#
                            
                                Elasticsearch asciifolding not working properly

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With