MongoDB to BigQuery

1 Answers

In my opinion, the best practice is building your own extractor. That can be done with the language of your choice and you can extract to CSV or JSON.

But if you looking to a fast way and if your data is not huge and can fit within one server, then I recommend using mongoexport. Let's assume you have a simple document structure such as below:

{
    "_id" : "tdfMXH0En5of2rZXSQ2wpzVhZ",
    "statuses" : [ 
        {
            "status" : "dc9e5511-466c-4146-888a-574918cc2534",
            "score" : 53.24388894
        }
    ],
    "stored_at" : ISODate("2017-04-12T07:04:23.545Z")
}

Then you need to define your BigQuery Schema (mongodb_schema.json) such as:

$ cat > mongodb_schema.json <<EOF
[
    { "name":"_id", "type": "STRING" },
    { "name":"stored_at", "type": "record", "fields": [
        { "name":"date", "type": "STRING" }
    ]},
    { "name":"statuses", "type": "record", "mode": "repeated", "fields": [
        { "name":"status", "type": "STRING" },
        { "name":"score", "type": "FLOAT" }
    ]}
]
EOF

Now, the fun part starts :-) Extracting data as JSON from your MongoDB. Let's assume you have a cluster with replica set name statuses, your db is sample, and your collection is status.

mongoexport \
    --host statuses/db-01:27017,db-02:27017,db-03:27017 \
    -vv \
    --db "sample" \
    --collection "status" \
    --type "json" \
    --limit 100000 \
    --out ~/sample.json

As you can see above, I limit the output to 100k records because I recommend you run sample and load to BigQuery before doing it for all your data. After running above command you should have your sample data in sample.json BUT there is a field $date which will cause you an error loading to BigQuery. To fix that we can use sed to replace them to simple field name:

# Fix Date field to make it compatible with BQ
sed -i 's/"\$date"/"date"/g' sample.json

Now you can compress, upload to Google Cloud Storage (GCS) and then load to BigQuery using following commands:

# Compress for faster load
gzip sample.json

# Move to GCloud
gsutil mv ./sample.json.gz gs://your-bucket/sample/sample.json.gz

# Load to BQ
bq load \
    --source_format=NEWLINE_DELIMITED_JSON \
    --max_bad_records=999999 \
    --ignore_unknown_values=true \
    --encoding=UTF-8 \
    --replace \
    "YOUR_DATASET.mongodb_sample" \
    "gs://your-bucket/sample/*.json.gz" \
    "mongodb_schema.json"

If everything was okay, then go back and remove --limit 100000 from mongoexport command and re-run above commands again to load everything instead of 100k sample.

ALTERNATIVE SOLUTION:

If you want more flexibility and performance is not your concern, then you can use mongo CLI tool as well. This way you can write your extract logic in a JavaScript and execute it against your data and then send output to BigQuery. Here is what I did for the same process but used JavaScript to output in CSV so I can load it much easier to BigQuery:

# Export Logic in JavaScript
cat > export-csv.js <<EOF
var size = 100000;
var maxCount = 1;
for (x = 0; x < maxCount; x = x + 1) {
    var recToSkip = x * size;
    db.entities.find().skip(recToSkip).limit(size).forEach(function(record) {
        var row = record._id + "," + record.stored_at.toISOString();;
        record.statuses.forEach(function (l) {
            print(row + "," + l.status + "," + l.score)
        });
    });
}
EOF

# Execute on Mongo CLI
_MONGO_HOSTS="db-01:27017,db-02:27017,db-03:27017/sample?replicaSet=statuses"
mongo --quiet \
    "${_MONGO_HOSTS}" \
    export-csv.js \
    | split -l 500000 --filter='gzip > $FILE.csv.gz' - sample_

# Load all Splitted Files to Google Cloud Storage
gsutil -m mv ./sample_* gs://your-bucket/sample/

# Load files to BigQuery
bq load \
    --source_format=CSV \
    --max_bad_records=999999 \
    --ignore_unknown_values=true \
    --encoding=UTF-8 \
    --replace \
    "YOUR_DATASET.mongodb_sample" \
    "gs://your-bucket/sample/sample_*.csv.gz" \
    "ID,StoredDate:DATETIME,Status,Score:FLOAT"

TIP: In above script I did small trick by piping output to able to split the output in multiple files with sample_ prefix. Also during split it will GZip the output so you can load it easier to GCS.

194

answered Sep 30 '22 01:09

Qorbani

Related questions
                            
                                Building a simple news feed in node + Mongodb + Redis
                            
                                What will happen,when mongodb db size will be times > then RAM?
                            
                                Insert Dictionary into MongoDB with c# driver
                            
                                Why using integer as a key with pymongo doesn't work?
                            
                                mongodb - perform batch query
                            
                                Docker: change folder where to store docker volumes
                            
                                What data type should I use to store an image with MongoDB?
                            
                                In MongoDB is it practical to keep all comments for a post in one document?
                            
                                how to implement number of views of a particular page
                            
                                Dont want to have to start mongod with `sudo mongod`
                            
                                can't make basic mongo shell script with authentication
                            
                                Concatenate string values in array in a single field in MongoDB
                            
                                MongoDB Date range query for past hour
                            
                                Pymongo: iterate over all documents in the collection
                            
                                how to solve this transaction error in mlab? [MongoError: Transaction numbers are ... support document-level locking]
                            
                                MongoDB / Pymongo Query with Datetime
                            
                                JavaScript execution failed connected to mongoHQ shell
                            
                                Are there any reasons why I should/shouldn't use ObjectId's in my RESTful url's
                            
                                how do I select all rows from a table in Mongodb in the console?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MongoDB to BigQuery

Tags:

mongodb

google-bigquery

sam

People also ask

1 Answers

Qorbani

Recent Activity

Donate For Us