Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ways to only process new(index after last run) data in Elasticsearch?

Is there a way to get the date and time that an elastic search document was written?

I am running es queries via spark and would prefer NOT to look through all documents that I have already processed. Instead I would like read the only documents that were ingested between the last time the program ran and now.

What is the best most efficient way to do this?

I have looked at;

  • updating to add a field with an array with booleans for if its been looked at by which analytic. The negative is waiting for the update to occur.
  • index per time frame method, which would be to break down the current indexes into smaller ones so by hour.The negative I see is the number of open file descriptors.
  • ??

Elasticsearch version 5.6

like image 738
SparkleGoat Avatar asked Dec 11 '17 19:12

SparkleGoat


People also ask

How do you refresh index in Elasticsearch?

By default, Elasticsearch periodically refreshes indices every second, but only on indices that have received one search request or more in the last 30 seconds. You can change this default interval using the index. refresh_interval setting.

What is Elasticsearch Upsert?

Upserts are "Update or Insert" operations. This means an upsert attempts to run your update script, but if the document does not exist (or the field you are trying to update doesn't exist), default values are inserted instead.


1 Answers

I posted the question on the elasticsearch discussion board and it appears using the ingest pipeline is the best option.

like image 169
SparkleGoat Avatar answered Nov 12 '22 01:11

SparkleGoat