I set up Elasticsearch Service and DynamoDb stream as described in this blog post. Now I need to add pre-existing data from DynamoDB to Elasticsearch.
I saw "Indexing pre-existing content" part of article but I dont know what to do with that Python code, where to execute it.
What the best option in this case to add pre-existing data?
DynamoDB is now integrated with Elasticsearch, enabling you to perform full-text queries on your data. Elasticsearch is a popular open source search and analytics engine designed to simplify real-time search and big data analytics. Elasticsearch integration is easy with the new Amazon DynamoDB Logstash Plugin.
To export a DynamoDB table, you use the AWS Data Pipeline console to create a new pipeline. The pipeline launches an Amazon EMR cluster to perform the actual export. Amazon EMR reads the data from DynamoDB, and writes the data to an export file in an Amazon S3 bucket.
Querying DynamoDB Create a new application or open up an existing one, and open the query editor. From the resource options, pick the newly created DynamoDB resource. Your interface will look like this. Select the table you want to query and the type of query you want to run.
Populating existing items to elasticsearch is not straightforward since dynamodb stream works for item changes not for existing records,
Here are few approaches with pro and cons
Scan all the existing items from dynamodb and send to elasticsearch
We can scan all the existing items and run a python code hosted on a ec2 machine to send data to es.
Pros:
a. Simple solution, nothing much required.
Cons:
a. Can not be run on a lambda function since the job may timeout if number of records are too many.
b. This approach is more of a one time thing and can not be used for incremental changes, (let's say we want to keep updating es as dynamodb data changes.)
Use dynamodb streams
We can enable dynamodb streams and build the pipeline as explained here. Now we can update some flag of existing items so that all the records flow through the pipeline and data goes to es.
Pros:
a. The pipeline can be used for incremental dynamodb changes.
b. No code duplication or one time effort, Every time we need to update one item in es, we update the item and it gets indexed in es.
c. No redundant, untested, one time code. (Huge issue in software world to maintain code.)
Cons:
a. Changing Prod data can be a dangerous thing and may not be allowed depending on use case.
This is slight modification of above approach
Instead of changing item in prod table we can create a Temporary table and enable stream on Temporary table. Utilize the pipeline mentioned in 2nd approach. And then copy items from prod table to Temporary table, The data will flow through the existing pipeline and get indexed in ES.
Pros:
a. No Prod data change is required and this pipeline can be used for incremental changes as well.
b. same as approach 2.
Cons:
a. Copying data from one table to another may take lots of time depending on data size.
b. Copying data from one table to another is a one time script, hence has maintainability issues.
Feel free to edit or suggest another approaches in comment.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With