Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch: How to store term vectors

I am working on a project where I heavily use Elasticsearch and leverage the moreLikeThis query to implement some features. The official documentation for the MLT query states the following:

In order to speed up analysis, it could help to store term vectors at index time, but at the expense of disk usage.

In the **How it works* section. The idea now is then to tune the mapping so store the pre calculated term vectors. The problem is that it seems unclear from the documentation how exactly this should be done. On one side, in the MLT documentation, they provide and example mapping that looks like this:

curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
  "mappings": {
    "movies": {
      "properties": {
        "title": {
          "type": "string",
          "term_vector": "yes"
         },
         "description": {
          "type": "string"
        },
        "tags": {
          "type": "string",
          "fields" : {
            "raw": {
              "type" : "string",
              "index" : "not_analyzed",
              "term_vector" : "yes"
            }
          }
        }
      }
    }
  }
}

On the other side, in the Term Vectors documentation, they provide a mapping in the Example 1 section that looks like this

curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
  "mappings": {
    "tweet": {
      "properties": {
        "text": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "index_analyzer" : "fulltext_analyzer"
         },
         "fullname": {
          "type": "string",
          "term_vector": "with_positions_offsets_payloads",
          "index_analyzer" : "fulltext_analyzer"
        }
      }
    }
    ....

This should create an index that stores term vectors, payloads etc.

Now the question is: which of the mapping should be used? Is it a flaw in the documentation or am I missing something?

like image 698
Nicola Miotto Avatar asked Aug 28 '15 10:08

Nicola Miotto


People also ask

What is term vector in Elasticsearch?

Returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user. Term vectors are realtime by default, not near realtime. This can be changed by setting realtime parameter to false .

What should I store in elastic search?

There are two types of data you might want to store in Elasticsearch: Your JSON documents, containing numbers, lists, text, geo coordinates, and all the other formats Elasticsearch supports.

What type of data does Elasticsearch store?

Elasticsearch stores data as JSON documents. Each document correlates a set of keys (names of fields or properties) with their corresponding values (strings, numbers, Booleans, dates, arrays of values, geolocations, or other types of data).

Is Elasticsearch a document store?

Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents.


1 Answers

You are right it doesn't seem to be explicitly mentioned in the current version of documents however in the upcoming release 2.0 documents there is a more detailed explanation.

Term vectors contain information about the terms produced by the analysis process, including:

  • a list of terms.
  • the position (or order) of each term.
  • the start and end character offsets mapping the term to its origin in the original string.

These term vectors can be stored so that they can be retrieved for a particular document.

The term_vector setting accepts:

  • no: No term vectors are stored. (default)
  • yes: Just the terms in the field are stored
  • with_positions: Terms and positions are stored
  • with_offsets: Terms and character offsets are stored
  • with_positions_offsets: Terms, positions, and character offsets are stored
like image 84
keety Avatar answered Oct 28 '22 13:10

keety