Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongodb dump (filtering documents and fields)

Tags:

I want to make a partial dump of a Mongodb database (partial as in, I need to filter some documents and some fields). Then, this dump will be imported onto another server.

I cannot use the mongodump utility, as it doesn't allow filtering fields.
I could use the mongoexport utility, as it allows filtering both documents and fields. Though, the documentation states that mongoexport can only output a JSON file and:

does not reliably preserve all rich BSON data types, because JSON can only represent a subset of the types supported by BSON.

  1. I find this statement a bit vague, and I don't entirely understand it. So, what happens exactly if I dump my database in JSON? What risks do I run? Do I risk losing some documents?
  2. If you think I should absolutely avoid using mongoexport in production, can I write my own Nodejs application to do the filtering and to output a dump in BSON? Or would that not be possible?
like image 426
MikiTesi Avatar asked Nov 25 '18 23:11

MikiTesi


1 Answers

It is possible to do using Views without resorting to writing a low level implementation reading and writing the BSON content. There are also options which do in fact preserve type even when using JSON formats, and you don't even need a "View" for that.

Using Views with mongodump

The basic premise is to create a View which only returns the content you want. A View can be the result of any aggregation pipeline expression.

For example, given a simple document in a collection:

db.test.insert({ "a": 1, "b": 2, "c": 3 })

You can create the View on that collection, with just wanted fields:

db.test.createView("testView", "test", [{ "$project": { "a": 1, "b": 2 } }])

Then exiting the mongo shell you can access the View from mongodump using the --viewsAsCollections option:

mongodump --db test --collection testView --viewsAsCollections

This exports just the named "collection" ( actually a View ) only. The --viewsAsCollections means that instead of mongodump just returning the view definition ( being essentially the aggregation pipeline ) it returns the results instead just like it was a real collection.

The resulting BSON content can then be loaded via mongorestore:

mongorestore --db other --collection test

Then the content from the BSON dump is actually written into the new database target of the host you are connecting to and with the specified collection name

use other
db.test.find()

{ "_id" : ObjectId("5bfb3e0eadd1d8af906ad140"), "a" : 1, "b" : 2 }

Noting also that as a View, the aggregation pipeline can really be anything, so $match statements can filter and you can transform or even actually "aggregate" however you want.

Using Views or --fields with mongoexport

In much the same way, the mongoexport utility can also access the content from a View.

Despite this not being "strict BSON", there is actually a standard with MongoDB which does in fact preserve the data types. This is actually covered in the documentation under MongoDB extended JSON.

So this is NOT a Binary format, and as JSON it does take considerably more storage space but the necessary information is indeed there.

For example:

db.mixed.insert({
  "a": NumberLong(1),
  "b": NumberDecimal("123.45"),
  "c": new Date(),
  "d":  "unwanted"
})

Which would appear in the mongo shell as:

{
        "_id" : ObjectId("5bfb428790b2b4e4241a015c"),
        "a" : NumberLong(1),
        "b" : NumberDecimal("123.45"),
        "c" : ISODate("2018-11-26T00:47:03.033Z"),
        "d" : "unwanted"
}

You can still set up a View:

db.createView("mixedView", "mixed", [{ "$project": { "a": 1, "b": 1, "c": 1 } }])

And the export will just pick up the data:

mongoexport --db test --collection mixedView > out.json

{
        "_id": {
                "$oid": "5bfb428790b2b4e4241a015c"
        },
        "a": {
                "$numberLong": "1"
        },
        "b": {
                "$numberDecimal": "123.45"
        },
        "c": {
                "$date": "2018-11-26T00:47:03.033Z"
        }
}

Or the same thing on the original collection, just using --fields for selection:

mongoexport --db test --collection mixed --fields a,b,c > out.json

With exactly the same output. Being that the only restriction is the --query can only support a regular query expression as given to find() or similar. This is not as flexible as a View, but can do rudimentary filtering for most needs.

The Extended JSON format is recognized by mongoimport and there are also implementations of parsers available for many languages which recognize this as well, and as the content is read it is inserted into the target collection with the "type" information preserved:

mongoimport --db other --collection mixed out.json

And then viewing the data:

use other
db.mixed.findOne()
{
        "_id" : ObjectId("5bfb428790b2b4e4241a015c"),
        "a" : NumberLong(1),
        "b" : NumberDecimal("123.45"),
        "c" : ISODate("2018-11-26T00:47:03.033Z")
}

So it is possible and the Extended JSON format exists for the purpose of data interchange in situations where sending binary content may not be viable or even desirable, but maintaining the "type" information is desirable.


Overall there are many options you can use without needing to revert to reading and writing binary BSON formats, or any other complex binary format to store the data in between transfers.

As a note on the "vague" passage, the actual supported BSON types are listed within the Extended JSON page of the documentation. You can even compare this to the BSON Specification to see that despite the "cautious" statement the common types of data you will indeed use are actually all supported there. Whilst some external interpretations of that spec may not adhere to understanding ALL of them, the utilities bundled such as mongoexport and mongoimport are indeed compliant.

like image 124
Neil Lunn Avatar answered Oct 12 '22 23:10

Neil Lunn