Elasticsearch replication of other system data?

Tags:

Suppose I want to use elasticsearch to implement a generic search on a website. The top search bar would be expected to find resources of all different kinds across the site. Documents for sure (uploaded/indexed via tika) but also things like clients, accounts, other people, etc.

For architectural reasons, most of the non-document stuff (clients, accounts) will exist in a relational database.

When implementing this search, option #1 would be to create document versions of everything, and then just use elasticsearch to run all aspects of the search, relying not at all on the relational database for finding different types of objects.

Option #2 would be to use elasticsearch only for indexing the documents, which would mean for a general "site search" feature, you'd have to farm out multiple searches to multiple systems, then aggregate the results before returning them.

Option #1 seems far superior, but the downside is that it requires that elastic search in essence have a copy of a great many things in the production relational database, plus that those copies be kept fresh as things change.

What's the best option for keeping these stores in sync, and am I correct in thinking that for general search, option #1 is superior? Is there an option #3?

618

asked Dec 27 '15 02:12

FrobberOfBits

2 Answers

You've pretty much listed the two main options there are when it comes to search across multiple data stores, i.e. search in one central data store (option #1) or search in all data stores and aggregate the results (option #2).

Both options would work, although option #2 has two main drawbacks:

It will require a substantial amount of logic to be developed in your application in order to "branch out" the searches to the multiple data stores and aggregate the results you get back.
The response times might be different for each data store, and thus, you will have to wait for the slowest data store to respond in order to present the search results to the user (unless you circumvent this by using different asynchronous technologies, such as Ajax, websocket, etc)

If you want to provide a better and more reliable search experience, option #1 would clearly get my vote (I take this way most of the time actually). As you've correctly stated, the main "drawback" of this option is that you need to keep Elasticsearch in synch with the changes in your other master data stores.

Since your other data stores will be relational databases, you have a few different options to keep them in synch with Elasticsearch, namely:

using the Logstash JDBC input
using the JDBC importer tool

These first two options work great but have one main disadvantage, i.e. they don't capture DELETEs on your table, they will only capture INSERTs and UPDATEs. This means that if you ever delete a user, account, etc, you will not be able to know that you have to delete the corresponding document in Elasticsearch. Unless, of course, you decide to delete the Elasticsearch index before each import session.

To alleviate this, you can use another tool which bases itself on the MySQL binlog and will thus be able to capture every event. There's one written in Go, one in Java and one in Python.

UPDATE:

Here is another interesting blog article on the subject: How to keep Elasticsearch synchronized with a relational database using Logstash

answered Oct 22 '22 00:10

Val

Please take a look at Debezium. It's a change data capture (CDC) platform, which allow you to stream your data.

I created a simple github repository, which shows how it works with PostgreSQL and ElasticSearch.

enter image description here

answered Oct 22 '22 00:10

Yegor Zaremba

Related questions
                            
                                How to search nested objects with Elasticsearch
                            
                                ElasticSearch, multi-match with filter?
                            
                                How to access Kibana from Amazon elasticsearch service?
                            
                                How do I do a partial match in Elasticsearch?
                            
                                What is the default user and password for elasticsearch?
                            
                                CURL escape single quote
                            
                                How to log all executed elasticsearch queries
                            
                                ElasticSearch - Optimal number of Shards per node
                            
                                How to update a document using elasticsearch-py?
                            
                                ElasticSearch group by multiple fields
                            
                                How to make elasticsearch add the timestamp field to every document in all indices?
                            
                                How to do "where not exists" type filtering in Kibana/ELK?
                            
                                elasticsearch - what to do with unassigned shards
                            
                                Elasticsearch- get all values for a given field?
                            
                                Elastic Kibana - install as windows service
                            
                                Elasticsearch always returning "mapping type is missing"
                            
                                Setting Elastic search limit to "unlimited"
                            
                                Elasticsearch Bulk Index JSON Data
                            
                                How do I retrieve more than 10000 results/events in Elastic-search
                            
                                Elasticsearch is still initializing the kibana index

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Elasticsearch replication of other system data?

Tags:

architecture

elasticsearch

FrobberOfBits

People also ask

2 Answers

Val

Yegor Zaremba

Recent Activity

Donate For Us