Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis). How can we do it? Should we export the mongo to csv?

like image 850
AJ222 Avatar asked Jan 28 '14 13:01

AJ222


2 Answers

I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.

To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.

You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.

Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).

So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.

In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:

   mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
   head data.json > sample.json
   ddlgenerator postgresql sample.json       

Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.

Next you upload the data into Redshift.

If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.

For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.

Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.

For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.

Here are some related resources although in the end I did not use them:

  • Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
  • Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
  • This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
  • Related Redit question
  • Blog post about dealing with JSON arrays in Redshift
like image 96
Mark Butler Avatar answered Sep 30 '22 14:09

Mark Butler


AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.

  • MongoDB versions 2.6.x and 3.x as a database source
  • Document Mode and Table Mode supported
  • Supports change data capture(CDC)

Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html

like image 31
faisal_kk Avatar answered Sep 30 '22 14:09

faisal_kk