I have a document in my elasticsearch with the following id: AVosj8FEIaetdb3CXpP-
I'm trying to access for every words in the fields it's tf-idf I did the following:
GET /cnn/cnn_article/AVosj8FEIaetdb3CXpP-/_termvectors
{
"fields" : ["author_wording"],
"term_statistics" : true,
"field_statistics" : true
}'
The response I've got is:
{
"_index": "dailystormer",
"_type": "dailystormer_article",
"_id": "AVosj8FEIaetdb3CXpP-",
"_version": 3,
"found": true,
"took": 1,
"term_vectors": {
"author_wording": {
"field_statistics": {
"sum_doc_freq": 3408583,
"doc_count": 16111,
"sum_ttf": 7851321
},
"terms": {
"318": {
"doc_freq": 4,
"ttf": 4,
"term_freq": 1,
"tokens": [
{
"position": 121,
"start_offset": 688,
"end_offset": 691
}
]
},
"742": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 122,
"start_offset": 692,
"end_offset": 695
}
]
},
"9971": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1,
"tokens": [
{
"position": 123,
"start_offset": 696,
"end_offset": 700
}
]
},
"a": {
"doc_freq": 14921,
"ttf": 163268,
"term_freq": 11,
"tokens": [
{
"position": 1,
"start_offset": 13,
"end_offset": 14
},
...
"you’re": {
"doc_freq": 1112,
"ttf": 1647,
"term_freq": 1,
"tokens": [
{
"position": 80,
"start_offset": 471,
"end_offset": 477
}
]
}
}
}
}
}
It returns me some interesting fields like the term frequency (tf) but not the tf-idf. Should I recompute it myself? Is that a good idea? How can I do so?
To compute tf-idf you need to do the following: and N = doc_count from your response. Elasticsearch do not provide implementation for calculating tf-idf, so you need to do it by yourself. The term and field statistics are only retrieved for the shard the requested document resides in.
idf (t) = log (N/ df (t)) Computation: Tf-idf is one of the best metrics to determine how significant a term is to a text in a series or a corpus. tf-idf is a weighting system that assigns a weight to each word in a document based on its term frequency (tf) and the reciprocal document frequency (tf) (idf).
Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and). I D F = l o g ( number of the documents in the corpus number of documents in the corpus contain the term) The TF-IDF of a term is calculated by multiplying TF and IDF scores.
The only difference is that TF is frequency counter for a term t in document d, where as DF is the count of occurrences of term t in the document set N. In other words, DF is the number of documents in which the word is present.
Yes, it returns you a tf
- term frequency (you had both term frequency for this field, and ttf - which is total term frequency, e.g. sum of all tf's across all fields) and df
- document frequency (you also had it in the response). You need to decide which tf-idf you want to calculate across only your field, or all fields. To compute tf-idf you need to do the following:
tf-idf = tf * idf
where
idf = log (N / df)
and N = doc_count
from your response. Elasticsearch do not provide implementation for calculating tf-idf, so you need to do it by yourself.
You can use this API:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html
{
"_index": "imdb",
"_type": "_doc",
"_version": 0,
"found": true,
"term_vectors": {
"plot": {
"field_statistics": {
"sum_doc_freq": 3384269,
"doc_count": 176214,
"sum_ttf": 3753460
},
"terms": {
"armored": {
"doc_freq": 27,
"ttf": 27,
"term_freq": 1,
"score": 9.74725
},
"industrialist": {
"doc_freq": 88,
"ttf": 88,
"term_freq": 1,
"score": 8.590818
},
"stark": {
"doc_freq": 44,
"ttf": 47,
"term_freq": 1,
"score": 9.272792
}
}
}
}
}
term_freq - term frequency. The number times a term appears in a field in one specific document.
doc_freq - document frequency. The number of documents a term appears in.
ttf - total term frequency. The number of times this term appears in all documents, that is, the sum of tf over all documents. Computed per field.
df and ttf are computed per shard and therefore these numbers can vary depending on the shard the current document resides in.
How are the scores calculated?
The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in foreground and background sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section.
Remember these definitions:
cluster – An Elasticsearch cluster consists of one or more nodes and is identifiable by its cluster name.
node – A single Elasticsearch instance. In most environments, each node runs on a separate box or virtual machine.
index – In Elasticsearch, an index is a collection of documents.
shard – Because Elasticsearch is a distributed search engine, an index is usually split into elements known as shards that are distributed across multiple nodes. Elasticsearch automatically manages the arrangement of these shards. It also rebalances the shards as necessary, so users need not worry about the details.
replica – By default, Elasticsearch creates five primary shards and one replica for each index. This means that each index will consist of five primary shards, and each shard will have one copy.
Allocating multiple shards and replicas is the essence of the design for distributed search capability, providing for high availability and quick access in searches against the documents within an index. The main difference between a primary and a replica shard is that only the primary shard can accept indexing requests. Both replica and primary shards can serve querying requests.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With