I set up Elasticsearch Service and DynamoDb stream as described in this blog post. Now I need to add pre-existing data from DynamoDB to Elasticsearch. I saw "Indexing pre-existing content" part of article but I dont know what to do with that Python code, where to execute it. What the best option in this case to add pre-existing data?

Populating existing items to elasticsearch is not straightforward since dynamodb stream works for item changes not for existing records, Here are few approaches with pro and cons <ol> <li> Scan all the existing items from dynamodb and send to elasticsearch We can scan all the existing items and run a python code hosted on a ec2 machine to send data to es. <code>Pros:</code> a. Simple solution, nothing much required. <code>Cons:</code> a. Can not be run on a lambda function since the job may timeout if number of records are too many. b. This approach is more of a one time thing and can not be used for incremental changes, (let's say we want to keep updating es as dynamodb data changes.) </li> <li> Use dynamodb streams We can enable dynamodb streams and build the pipeline as explained here. Now we can update some flag of existing items so that all the records flow through the pipeline and data goes to es. <code>Pros:</code> a. The pipeline can be used for incremental dynamodb changes. b. No code duplication or one time effort, Every time we need to update one item in es, we update the item and it gets indexed in es. c. No redundant, untested, one time code. (Huge issue in software world to maintain code.) <code>Cons:</code> a. Changing Prod data can be a dangerous thing and may not be allowed depending on use case. </li> <li> This is slight modification of above approach Instead of changing item in prod table we can create a Temporary table and enable stream on Temporary table. Utilize the pipeline mentioned in 2nd approach. And then copy items from prod table to Temporary table, The data will flow through the existing pipeline and get indexed in ES. <code>Pros:</code> a. No Prod data change is required and this pipeline can be used for incremental changes as well. b. same as approach 2. <code>Cons:</code> a. Copying data from one table to another may take lots of time depending on data size. b. Copying data from one table to another is a one time script, hence has maintainability issues. </li> </ol> Feel free to edit or suggest another approaches in comment.

How to add pre-existing data from DynamoDB to Elasticsearch?

1 Answers

Populating existing items to elasticsearch is not straightforward since dynamodb stream works for item changes not for existing records,

Here are few approaches with pro and cons

Scan all the existing items from dynamodb and send to elasticsearch

We can scan all the existing items and run a python code hosted on a ec2 machine to send data to es.

Pros:

a. Simple solution, nothing much required.

Cons:

a. Can not be run on a lambda function since the job may timeout if number of records are too many.

b. This approach is more of a one time thing and can not be used for incremental changes, (let's say we want to keep updating es as dynamodb data changes.)
Use dynamodb streams

We can enable dynamodb streams and build the pipeline as explained here. Now we can update some flag of existing items so that all the records flow through the pipeline and data goes to es.

Pros:

a. The pipeline can be used for incremental dynamodb changes.

b. No code duplication or one time effort, Every time we need to update one item in es, we update the item and it gets indexed in es.

c. No redundant, untested, one time code. (Huge issue in software world to maintain code.)

Cons:

a. Changing Prod data can be a dangerous thing and may not be allowed depending on use case.
This is slight modification of above approach

Instead of changing item in prod table we can create a Temporary table and enable stream on Temporary table. Utilize the pipeline mentioned in 2nd approach. And then copy items from prod table to Temporary table, The data will flow through the existing pipeline and get indexed in ES.

Pros:

a. No Prod data change is required and this pipeline can be used for incremental changes as well.

b. same as approach 2.

Cons:

a. Copying data from one table to another may take lots of time depending on data size.

b. Copying data from one table to another is a one time script, hence has maintainability issues.

Feel free to edit or suggest another approaches in comment.

195

answered Sep 20 '22 13:09

best wishes

Related questions
                            
                                Asciifolding not working Elastic Search Rails
                            
                                Case-insensitive replace in pattern_replace
                            
                                Elasticsearch curl: (7) couldn't connect to host
                            
                                ElasticSearch - Searching with hyphens in name
                            
                                Elasticsearch and .NET
                            
                                Elasticsearch failed to recover after crash
                            
                                findAllByX incorrectly limited to 10 results
                            
                                Boost a query word in Elasticsearch
                            
                                Elasticsearch cannot start when bind to public ip address
                            
                                Elasticsearch 2.1 - Deprecated search types
                            
                                Elasticsearch management tools like phpMyAdmin for mysql [closed]
                            
                                Elasticsearch suggestions with filter
                            
                                Dedup elasticsearch results using multiple fields as unique key
                            
                                How to build a GraphQL API on top of a Django/Elasticsearch/MySQL backend?
                            
                                Elasticsearch - Aggregations on part of bool query
                            
                                Unable to install Search Guard plugin for Elasticsearch-5.x
                            
                                Why install logstash if I can just send the data through REST to elasticsearch?
                            
                                How to get all field names in elasticsearch index
                            
                                What is the fastest way of indexing to ElasticSearch
                            
                                What's the best Kibana multi tenancy free open source project?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to add pre-existing data from DynamoDB to Elasticsearch?

Tags:

elasticsearch

amazon-dynamodb

amazon-dynamodb-streams

A. Bimer

People also ask

1 Answers

best wishes

Recent Activity

Donate For Us