Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Elasticsearch word frequency and relations

I am wondering if it is possible at all to get the top ten most frequent words in an Elasticsearch field across an entire index or alias.

Here is what I'm trying to do:

I am indexing text documents extracted from various document types (Word, Powerpoint, PDF, etc) these are analyzed and stored in a field called doc_content. I would like to know if there is a way to find the most frequent word(s) in a particular index that are stored in the doc_content field.

To make it clearer, lets assume I am indexing invoices from Amazon and eBay for example. Now lets assume I have 100 invoices from amazon and 20 invoices from ebay. Lets also assume that the word "amazon" occurs twice in each amazon invoice and the word "ebay" occurs 3 times in each ebay invoice.

Now, is there a way to get an aggregate of sort that tells me that the word "amazon" appears in my index 200 times (100 invoices x 2 occurrences/invoice) and the word "ebay" occurs 60 times (20 invoices x 3 occurrences/invoice).

My other question is if the former is possible, then is there a way to determine what is the most frequent word that comes after a certain word?

For example: lets assume I have 100 documents. 60 of these documents contains the term "Old Cat" and 40 contains the term "Old Dog" and for the sake of argument lets assume that these words only appear once in each document.

Now, if we can get the frequency of the word "old" which in our case should be 100. Can we then determine a relation to the word that comes right after it to have something like this:

               __________ Cat (60)
              |
Old (100)-----|
              |__________ Dog (40)
like image 998
Zaid Amir Avatar asked May 04 '15 05:05

Zaid Amir


1 Answers

To get term frequencies you could use term vectors. However, you would first have to store them and second, you can retrieve them only for a given document.

As far as I know, it's not possible to aggregate over term vectors.

Maybe you could achieve some of what you want using scripted fields. But then again, Groovy is currently disfavoured because of security issues and aggregating over scripted fields is potentially quite slow.

By the way, similar questions have been asked before:

  • Aggregate Terms Usage Count
  • elasticsearch - Return term frequency of a single field
like image 73
Jakub Kotowski Avatar answered Oct 11 '22 22:10

Jakub Kotowski