We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data. I've looked around for options to make the data available and easier to consume, and have settled on two promising options: <ol> <li>AWS Redshift </li> <li>Hadoop + Hive </li> </ol> We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight). As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster. I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising. Has anyone done something like this?

I ended up coding up our own migrator using NodeJS. I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end. Timestamped data Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp. Plugins returning cursors We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin. How the cursors are used The migrator engine then uses this cursor, and loops through each record. It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes. Delimited exports from S3 into a table on redshift The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table. So for example, if I had a plugin configured with a table name of <code>employees</code>, it would create a temp table with the name of <code>temp_employees</code>. Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an <code>is_deleted</code> flag in redshift. Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine. So in summary, each plugin/migration provides the following to the engine: <ul> <li>A cursor, which optionally uses the last migrated date passed to it from the engine, in order to ensure that only deltas are moved across. </li> <li>A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file</li> <li>A schema file, this is a SQL file containing the schema for the table at redshift</li> </ul>

MongoDB into AWS Redshift

Tags:

mongodb

amazon-web-services

hadoop

amazon-redshift

We've got a pretty big MongoDB instance with sharded collections. It's reached a point where it's becoming too expensive to rely on MongoDB query capabilities (including aggregation framework) for insight to the data.

I've looked around for options to make the data available and easier to consume, and have settled on two promising options:

AWS Redshift
Hadoop + Hive

We want to be able to use a SQL like syntax to analyze our data, and we want close to real time access to the data (a few minutes latency is fine, we just don't want to wait for the whole MongoDB to sync overnight).

As far as I can gather, for option 2, one can use this https://github.com/mongodb/mongo-hadoop to move data over from MongoDB to a Hadoop cluster.

I've looked high and low, but I'm struggling to find a similar solution for getting MongoDB into AWS Redshift. From looking at Amazon articles, it seems like the correct way to go about it is to use AWS Kinesis to get the data into Redshift. That said, I can't find any example of someone that did something similar, and I can't find any libraries or connectors to move data from MongoDB into a Kinesis stream. At least nothing that looks promising.

Has anyone done something like this?

990

asked Oct 24 '14 10:10

hendrikswan

2 Answers

I ended up coding up our own migrator using NodeJS. I got a bit irritated with answers explaining what redshift and MongoDB is, so I decided I'll take the time to share what I had to do in the end.

Timestamped data

Basically we ensure that all our MongoDB collections that we want to be migrated to tables in redshift are timestamped, and indexed according to that timestamp.

Plugins returning cursors

We then code up a plugin for each migration that we want to do from a mongo collection to a redshift table. Each plugin returns a cursor, which takes the last migrated date into account (passed to it from the migrator engine), and only returns the data that has changed since the last successful migration for that plugin.

How the cursors are used

The migrator engine then uses this cursor, and loops through each record. It calls back to the plugin for each record, to transform the document into an array, which the migrator then uses to create a delimited line which it streams to a file on disk. We use tabs to delimit this file, as our data contained a lot of commas and pipes.

Delimited exports from S3 into a table on redshift

The migrator then uploads the delimited file onto S3, and runs the redshift copy command to load the file from S3 into a temp table, using the plugin configuration to get the name and a convention to denote it as a temporary table.

So for example, if I had a plugin configured with a table name of employees, it would create a temp table with the name of temp_employees.

Now we've got data in this temp table. And the records in this temp table get their ids from the originating MongoDB collection. This allows us to then run a delete against the target table, in our example, the employees table, where the id is present in the temp table. If any of the tables don't exist, it gets created on the fly, based on a schema provided by the plugin. And so we get to insert all the records from the temp table into the target table. This caters for both new records and updated records. We only do soft deletes on our data, so it'll be updated with an is_deleted flag in redshift.

Once this whole process is done, the migrator engine stores a timestamp for the plugin in a redshift table, in order to keep track of when the migration last run successfully for it. This value is then passed to the plugin the next time the engine decides it should migrate data, allowing the plugin to use the timestamp in the cursor it needs to provide to the engine.

So in summary, each plugin/migration provides the following to the engine:

A cursor, which optionally uses the last migrated date passed to it from the engine, in order to ensure that only deltas are moved across.
A transform function, which the engine uses to turn each document in the cursor into a delimited string, which gets appended to an export file
A schema file, this is a SQL file containing the schema for the table at redshift

answered Oct 08 '22 15:10

hendrikswan

Redshift is a data ware housing product and Mongo DB is a NoSQL DB. Clearly, they are not a replacement of each other and can co-exist and serve different purpose. Now how to save and update records at both places. You can move all Mongo DB data to Redshift as a one time activity. Redshift is not a good fit for real time write. For Near Real Time Sync to Redshift, you should Modify program that writes into Mongo DB. Let that program also writes into S3 locations. S3 location to redshift movement can be done on regular interval.

answered Oct 08 '22 15:10

kartik

Related questions
                            
                                Ruby: Binary String to IO
                            
                                Is it possible for a MongoDB connection to timeout in Python?
                            
                                Mongo C# Driver and ObjectID JSON String Format
                            
                                Is it possible to search multiple Mongoose models at once?
                            
                                Can I increase the maximum number of indexes per collection in mongodb？
                            
                                How to handle outside insert into Meteor database?
                            
                                What is the hash function used by MongoDB to hash the database user passwords?
                            
                                How to access specific value from mongoose query callback?
                            
                                How to select certain fields with mongodb doctrine in symfony2
                            
                                Concurrent access to a document with mongoose
                            
                                .NET, Layered Architecture & MongoDB - What to use as ID?
                            
                                MongoDB / Meteor / Export MONGO_URL to deployed applications
                            
                                The `.mongorc.js` is not found, but there is one
                            
                                Declaration of Class::save() should be compatible with that of Class::save()
                            
                                Elasticsearch curl: (7) couldn't connect to host
                            
                                How to store dynamic fields in Mongo DB
                            
                                Does spring data mongodb supports manual referencing or it supports only DBrefs?
                            
                                Golang mongodb mgo driver Upsert / UpsertId documentation
                            
                                How to query when connecting mongodb with apache-spark
                            
                                Chrome dev - Cannot assign to read only property

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MongoDB into AWS Redshift

Tags:

mongodb

amazon-web-services

hadoop

amazon-redshift

hendrikswan

People also ask

2 Answers

hendrikswan

kartik

Recent Activity

Donate For Us