Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch: getting the tf-idf of every term in a given document

I have a document in my elasticsearch with the following id: AVosj8FEIaetdb3CXpP- I'm trying to access for every words in the fields it's tf-idf I did the following:

GET /cnn/cnn_article/AVosj8FEIaetdb3CXpP-/_termvectors
{
  "fields" : ["author_wording"],
  "term_statistics" : true,
  "field_statistics" : true
}'

The response I've got is:

{
  "_index": "dailystormer",
  "_type": "dailystormer_article",
  "_id": "AVosj8FEIaetdb3CXpP-",
  "_version": 3,
  "found": true,
  "took": 1,
  "term_vectors": {
    "author_wording": {
      "field_statistics": {
        "sum_doc_freq": 3408583,
        "doc_count": 16111,
        "sum_ttf": 7851321
      },
      "terms": {
        "318": {
          "doc_freq": 4,
          "ttf": 4,
          "term_freq": 1,
          "tokens": [
            {
              "position": 121,
              "start_offset": 688,
              "end_offset": 691
            }
          ]
        },
        "742": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 122,
              "start_offset": 692,
              "end_offset": 695
            }
          ]
        },
        "9971": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 123,
              "start_offset": 696,
              "end_offset": 700
            }
          ]
        },
        "a": {
          "doc_freq": 14921,
          "ttf": 163268,
          "term_freq": 11,
          "tokens": [
            {
              "position": 1,
              "start_offset": 13,
              "end_offset": 14
            },
            ...
            "you’re": {
          "doc_freq": 1112,
          "ttf": 1647,
          "term_freq": 1,
          "tokens": [
            {
              "position": 80,
              "start_offset": 471,
              "end_offset": 477
            }
          ]
        }
      }
    }
  }
}

It returns me some interesting fields like the term frequency (tf) but not the tf-idf. Should I recompute it myself? Is that a good idea? How can I do so?

like image 491
mel Avatar asked Feb 14 '17 08:02

mel


People also ask

How to compute tf-idf in Elasticsearch?

To compute tf-idf you need to do the following: and N = doc_count from your response. Elasticsearch do not provide implementation for calculating tf-idf, so you need to do it by yourself. The term and field statistics are only retrieved for the shard the requested document resides in.

What is the difference between tf-idf and IDF?

idf (t) = log (N/ df (t)) Computation: Tf-idf is one of the best metrics to determine how significant a term is to a text in a series or a corpus. tf-idf is a weighting system that assigns a weight to each word in a document based on its term frequency (tf) and the reciprocal document frequency (tf) (idf).

How is the tf-idf of a term calculated?

Words unique to a small percentage of documents (e.g., technical jargon terms) receive higher importance values than words common across all documents (e.g., a, the, and). I D F = l o g ( number of the documents in the corpus number of documents in the corpus contain the term) The TF-IDF of a term is calculated by multiplying TF and IDF scores.

What is the difference between TF and DF in MS Word?

The only difference is that TF is frequency counter for a term t in document d, where as DF is the count of occurrences of term t in the document set N. In other words, DF is the number of documents in which the word is present.


2 Answers

Yes, it returns you a tf - term frequency (you had both term frequency for this field, and ttf - which is total term frequency, e.g. sum of all tf's across all fields) and df - document frequency (you also had it in the response). You need to decide which tf-idf you want to calculate across only your field, or all fields. To compute tf-idf you need to do the following:

tf-idf = tf * idf

where

idf = log (N / df)

and N = doc_count from your response. Elasticsearch do not provide implementation for calculating tf-idf, so you need to do it by yourself.

like image 64
Mysterion Avatar answered Nov 15 '22 05:11

Mysterion


You can use this API:

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html

{
   "_index": "imdb",
   "_type": "_doc",
   "_version": 0,
   "found": true,
   "term_vectors": {
      "plot": {
         "field_statistics": {
            "sum_doc_freq": 3384269,
            "doc_count": 176214,
            "sum_ttf": 3753460
         },
         "terms": {
            "armored": {
               "doc_freq": 27,
               "ttf": 27,
               "term_freq": 1,
               "score": 9.74725
            },
            "industrialist": {
               "doc_freq": 88,
               "ttf": 88,
               "term_freq": 1,
               "score": 8.590818
            },
            "stark": {
               "doc_freq": 44,
               "ttf": 47,
               "term_freq": 1,
               "score": 9.272792
            }
         }
      }
   }
}

term_freq - term frequency. The number times a term appears in a field in one specific document.

doc_freq - document frequency. The number of documents a term appears in.

ttf - total term frequency. The number of times this term appears in all documents, that is, the sum of tf over all documents. Computed per field.

df and ttf are computed per shard and therefore these numbers can vary depending on the shard the current document resides in.

How are the scores calculated?

The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in foreground and background sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section.

Remember these definitions:

cluster – An Elasticsearch cluster consists of one or more nodes and is identifiable by its cluster name.

node – A single Elasticsearch instance. In most environments, each node runs on a separate box or virtual machine.

index – In Elasticsearch, an index is a collection of documents.

shard – Because Elasticsearch is a distributed search engine, an index is usually split into elements known as shards that are distributed across multiple nodes. Elasticsearch automatically manages the arrangement of these shards. It also rebalances the shards as necessary, so users need not worry about the details.

replica – By default, Elasticsearch creates five primary shards and one replica for each index. This means that each index will consist of five primary shards, and each shard will have one copy.

Allocating multiple shards and replicas is the essence of the design for distributed search capability, providing for high availability and quick access in searches against the documents within an index. The main difference between a primary and a replica shard is that only the primary shard can accept indexing requests. Both replica and primary shards can serve querying requests.

like image 41
artamonovdev Avatar answered Nov 15 '22 07:11

artamonovdev