Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Blue/Green "deployment" of elasticsearch data?

Tags:

I am planning on extracting (essentially scraping, with permission) some data from a web-page and store that in elasticsearch (you know, for search).

While I have permission to scrape the data from the site,

  • there is no API or another structured source for this data
  • it's manually authored straight into HTML
  • there are no unique identifiers that differentiate one entry from another (I will essentially be extracting around 1,000-5,000 entries from the DOM).

When I store this in es, I am planning to put this into one index and into a mapping type, say thing.

However, over time, the source (the HTML web page) is likely to change as they add/remove/change content of some of these entries. Since there are no identifiers in the source, I can't easily identify new ones (and even worse, deleted ones or changed ones).

I want to keep my es index up to date and what I am thinking is some sort of a blue-green mechanism:

  • I run the extraction process at some schedule (daily/weekly) depending on the velocity of the source changing
  • Every time it runs the process produces another index (or could be a new cluster altogether). Say the current index is index-prod and the new one built by the process is index-rc (release candidate)
  • It validates index-rc based on some heuristics (a flexible velocity check on the number of entries, sample queries that we know should work etc.)
  • And if it's valid, it either:
    • A. slowly flips queries into the new cluster/index
    • or B. flips in one shot to the new cluster/index

I am planning on hosting the elasticsearch cluster using AWS Elastisearch Service and could possibly concote something using Route 53 CNAMEs (and maybe ELB?) but I wanted to know if there is a more implicit support in elasticsearch itself for doing this?

Essentially, I want to swap one index's data for another.

like image 954
arnab Avatar asked Nov 02 '16 16:11

arnab


1 Answers

You don't need to swap the entire data between indexes... if I get it right, you can use Aliases to change from the actual to the next index version.

To slowly change the queries endpoint, I also suppose that some Load Balancer, like nginx, is the best solution. There are many cases about this on the web.

like image 148
Allan Sene Avatar answered Sep 25 '22 16:09

Allan Sene