Elasticsearch: getting the tf-idf of every term in a given document

Tags:

I have a document in my elasticsearch with the following id: AVosj8FEIaetdb3CXpP- I'm trying to access for every words in the fields it's tf-idf I did the following:

GET /cnn/cnn_article/AVosj8FEIaetdb3CXpP-/_termvectors
{
  "fields" : ["author_wording"],
  "term_statistics" : true,
  "field_statistics" : true
}'

The response I've got is:

{
  "_index": "dailystormer",
  "_type": "dailystormer_article",
  "_id": "AVosj8FEIaetdb3CXpP-",
  "_version": 3,
  "found": true,
  "took": 1,
  "term_vectors": {
    "author_wording": {
      "field_statistics": {
        "sum_doc_freq": 3408583,
        "doc_count": 16111,
        "sum_ttf": 7851321
      },
      "terms": {
        "318": {
          "doc_freq": 4,
          "ttf": 4,
          "term_freq": 1,
          "tokens": [
            {
              "position": 121,
              "start_offset": 688,
              "end_offset": 691
            }
          ]
        },
        "742": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 122,
              "start_offset": 692,
              "end_offset": 695
            }
          ]
        },
        "9971": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1,
          "tokens": [
            {
              "position": 123,
              "start_offset": 696,
              "end_offset": 700
            }
          ]
        },
        "a": {
          "doc_freq": 14921,
          "ttf": 163268,
          "term_freq": 11,
          "tokens": [
            {
              "position": 1,
              "start_offset": 13,
              "end_offset": 14
            },
            ...
            "you’re": {
          "doc_freq": 1112,
          "ttf": 1647,
          "term_freq": 1,
          "tokens": [
            {
              "position": 80,
              "start_offset": 471,
              "end_offset": 477
            }
          ]
        }
      }
    }
  }
}

It returns me some interesting fields like the term frequency (tf) but not the tf-idf. Should I recompute it myself? Is that a good idea? How can I do so?

491

asked Feb 14 '17 08:02

mel

2 Answers

Yes, it returns you a tf - term frequency (you had both term frequency for this field, and ttf - which is total term frequency, e.g. sum of all tf's across all fields) and df - document frequency (you also had it in the response). You need to decide which tf-idf you want to calculate across only your field, or all fields. To compute tf-idf you need to do the following:

tf-idf = tf * idf

where

idf = log (N / df)

and N = doc_count from your response. Elasticsearch do not provide implementation for calculating tf-idf, so you need to do it by yourself.

answered Nov 15 '22 05:11

Mysterion

You can use this API:

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html

{
   "_index": "imdb",
   "_type": "_doc",
   "_version": 0,
   "found": true,
   "term_vectors": {
      "plot": {
         "field_statistics": {
            "sum_doc_freq": 3384269,
            "doc_count": 176214,
            "sum_ttf": 3753460
         },
         "terms": {
            "armored": {
               "doc_freq": 27,
               "ttf": 27,
               "term_freq": 1,
               "score": 9.74725
            },
            "industrialist": {
               "doc_freq": 88,
               "ttf": 88,
               "term_freq": 1,
               "score": 8.590818
            },
            "stark": {
               "doc_freq": 44,
               "ttf": 47,
               "term_freq": 1,
               "score": 9.272792
            }
         }
      }
   }
}

term_freq - term frequency. The number times a term appears in a field in one specific document.

doc_freq - document frequency. The number of documents a term appears in.

ttf - total term frequency. The number of times this term appears in all documents, that is, the sum of tf over all documents. Computed per field.

df and ttf are computed per shard and therefore these numbers can vary depending on the shard the current document resides in.

How are the scores calculated?

The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in foreground and background sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section.

Remember these definitions:

cluster – An Elasticsearch cluster consists of one or more nodes and is identifiable by its cluster name.

node – A single Elasticsearch instance. In most environments, each node runs on a separate box or virtual machine.

index – In Elasticsearch, an index is a collection of documents.

shard – Because Elasticsearch is a distributed search engine, an index is usually split into elements known as shards that are distributed across multiple nodes. Elasticsearch automatically manages the arrangement of these shards. It also rebalances the shards as necessary, so users need not worry about the details.

replica – By default, Elasticsearch creates five primary shards and one replica for each index. This means that each index will consist of five primary shards, and each shard will have one copy.

Allocating multiple shards and replicas is the essence of the design for distributed search capability, providing for high availability and quick access in searches against the documents within an index. The main difference between a primary and a replica shard is that only the primary shard can accept indexing requests. Both replica and primary shards can serve querying requests.

answered Nov 15 '22 07:11

artamonovdev

Related questions
                            
                                AWS elasticsearch availability zone awareness and replica
                            
                                aggregation query and return all fields in elasticsearch
                            
                                elasticsearch:use script to update nested field?
                            
                                How do you export/import "index-pattern" types in Kibana
                            
                                Accessing kibana on local network
                            
                                Elasticsearch 2.4, Exists filter for nested objects not working
                            
                                ElasticSearch on Raspberry Pi exited
                            
                                Elasticsearch - mutiplication of 2 fields and then sum aggregation
                            
                                I cannot start logstash on my machine. Error message inside
                            
                                labelling different lines on split operation
                            
                                ElasticSearch: ignore_malformed not working
                            
                                Elasticsearch 6.0.1 NoSuchFieldError: LUCENE_6_0_0
                            
                                Can't install my own ElasticSearch plugin
                            
                                What's the recommended ElasticSearch deployment on Windows Azure?
                            
                                Filter on empty string using ElasticSearch/Nest
                            
                                in NEST, how do I dynamically build a query from a list of terms?
                            
                                Get buckets average of a date_histogram, elasticsearch
                            
                                ElasticSearch/Lucene query string — select "field X exists"
                            
                                Elasticsearch - get nested fields
                            
                                Delete by query not working

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Elasticsearch: getting the tf-idf of every term in a given document

Tags:

elasticsearch

nlp

tf-idf

mel

People also ask

2 Answers

Mysterion

artamonovdev

Recent Activity

Donate For Us