I am working on a project where I heavily use Elasticsearch and leverage the moreLikeThis query to implement some features. The official documentation for the MLT query states the following:
In order to speed up analysis, it could help to store term vectors at index time, but at the expense of disk usage.
In the **How it works* section. The idea now is then to tune the mapping so store the pre calculated term vectors. The problem is that it seems unclear from the documentation how exactly this should be done. On one side, in the MLT documentation, they provide and example mapping that looks like this:
curl -s -XPUT 'http://localhost:9200/imdb/' -d '{
"mappings": {
"movies": {
"properties": {
"title": {
"type": "string",
"term_vector": "yes"
},
"description": {
"type": "string"
},
"tags": {
"type": "string",
"fields" : {
"raw": {
"type" : "string",
"index" : "not_analyzed",
"term_vector" : "yes"
}
}
}
}
}
}
}
On the other side, in the Term Vectors documentation, they provide a mapping in the Example 1 section that looks like this
curl -s -XPUT 'http://localhost:9200/twitter/' -d '{
"mappings": {
"tweet": {
"properties": {
"text": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"store" : true,
"index_analyzer" : "fulltext_analyzer"
},
"fullname": {
"type": "string",
"term_vector": "with_positions_offsets_payloads",
"index_analyzer" : "fulltext_analyzer"
}
}
}
....
This should create an index that stores term vectors, payloads etc.
Now the question is: which of the mapping should be used? Is it a flaw in the documentation or am I missing something?
Returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user. Term vectors are realtime by default, not near realtime. This can be changed by setting realtime parameter to false .
There are two types of data you might want to store in Elasticsearch: Your JSON documents, containing numbers, lists, text, geo coordinates, and all the other formats Elasticsearch supports.
Elasticsearch stores data as JSON documents. Each document correlates a set of keys (names of fields or properties) with their corresponding values (strings, numbers, Booleans, dates, arrays of values, geolocations, or other types of data).
Elasticsearch is a distributed document store. Instead of storing information as rows of columnar data, Elasticsearch stores complex data structures that have been serialized as JSON documents.
You are right it doesn't seem to be explicitly mentioned in the current version of documents however in the upcoming release 2.0 documents there is a more detailed explanation.
Term vectors contain information about the terms produced by the analysis process, including:
- a list of terms.
- the position (or order) of each term.
- the start and end character offsets mapping the term to its origin in the original string.
These term vectors can be stored so that they can be retrieved for a particular document.
The
term_vector
setting accepts:
no
: No term vectors are stored. (default)yes
: Just the terms in the field are storedwith_positions
: Terms and positions are storedwith_offsets
: Terms and character offsets are storedwith_positions_offsets
: Terms, positions, and character offsets are stored
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With