Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add many records to mongoDB from directory of JSON files?

I have about a million JSON files saved across many sub-directories of the directory "D:/njs/nodetest1/imports/source1/" and I want to import them into the collection "users" in my mongoDB database.

The following code correctly traverses through the file system. As you can see, it reads each item in the directory and if that item is a directory it reads each item in it. For each item that is not a directory it performs a some operations on it before sending a variable holding an to a function.

function traverseFS (path){
     var files = fs.readdirSync(path);
     for (var i in files){
          var currentFile = path + '/' + files[i];
          var stats = fs.statSync(currentFile);
          if (stats.isFile())
               runOnFile(currentFile);
          else
               traverseFS(currentFile);
     }
}
traverseFS("D:/njs/nodetest1/imports/source1/")

Next, I run a few operations on the code (see below). This reads the file, parses it into a JSON object, reads two attributes of that object into variables,creates an object in the variable "entry" and passes the variable to another function.

function runOnFile(currentFile){
    var fileText = fs.readFileSync(currentFile,'utf8');
    var generatedJSON = JSON.parse(fileText);
    var recordID = generatedJSON.recordID;
    var recordText = generatedJSON.recordTexts;
    var entry = {recordID:recordID, recordText:recordText};
    insertRecord(entry);
}

The final function then should be used to insert the data into mongoDB. I think that this is where thing go wrong.

function insertRecord(entry){
    var MongoClient = mongodb.MongoClient;
    var MongoURL = 'mongodb://localhost:27017/my_database_name';
    MongoClient.connect(MongoURL, function (err, db) {
         var collection = db.collection('users');
         collection.insert([entry], function (err, result) {
             db.close();
         });
     });
 }

I expected this to run through the file structure, reading the JSON files into objects and then inserting those objects into my mongoDB. Instead it reads the first file into the database and then stops/hangs.

Notes:

  • I don't want to use mongoimport because I don't want to insert all the data from these files into my MongoDB database. I however am not tied to any aspect of this approach. If some other solution exists I am open to it.
  • This connects to the database just fine. For each item in the directory this successfully creates an "entry" object and passes it to the insertRecord function. In other words, the problem must be occuring in the insertRecord section. But it obviously could be caused by something earlier in the process.
  • If I add error handling, no errors are produced. I have left the error handling out of this post because it clutters the readability of the code snippets.
like image 681
COMisHARD Avatar asked Dec 06 '16 23:12

COMisHARD


People also ask

How do I import a JSON file into MongoDB?

To import JSON file you need to follow the following steps: Step 1: Open a command prompt and give command mongod to connect with MongoDB server and don't close this cmd to stay connected to the server. Step 2: Open another command prompt and run the mongo shell. Using the mongo command.

Can we import JSON in MongoDB?

The process to import JSON into MongoDB depends on the operating system and the programming language you are using. However, the key to importing is to access the MongoDB database and parsing the file that you want to import. You can then go through each document sequentially and insert into MongoDB.

How do I import a collection in MongoDB?

To import data to a MongoDB database, you can use mongoimport to import specific collections data, or you can use mongorestore to import a binary (BSON) full database backup. The exported database file must be stored locally on the same machine as your client.


3 Answers

As per mongodb2.2 (current latest) documentation, insert is deprecated

DEPRECATED

Use insertOne, insertMany or bulkWrite

So the short answer is probably to change collection.insert([entry], ...) to collection.insertOne(entry, ...) and you're done.


Then for the long answer, you say "about a million of json files", which typically deserves a full async approach with the least amount of overhead.

There are two (potential) bottlenecks in the sample code:

  • fs.readFileSync, this is a blocking operation
  • the connecting, inserting a record and closing the database connection

Both are executed "about a million of times". Granted, an import is not usually done over and over again and (hopefully) not on a machine which needs its performance for other important tasks. Still, the sample code can easily be made more robust.

Consider using the glob module to obtain the list of json file.

glob('imports/**/*.json', function(error, files) {...})

This provides you with the full list of files easily in an async fashion.

Then consider connecting to the database just once, insert everything and close once.

Maintaining more or less the same steps you have in the sample, I'd suggest something like:

var glob = require('glob'),
    mongodb = require('mongodb'),
    fs = require('fs'),
    MongoClient = mongodb.MongoClient,
    mongoDSN = 'mongodb://localhost:27017/my_database_name',
    collection; //  moved this to the "global" scope so we can do it only once

function insertRecord(json, done) {
    var recordID = json.recordID || null,
        recordText = json.recordText || null;

    //  the question implies some kind of validation/sanitation/preparation..
    if (recordID && recordText) {
        //  NOTE: insert was changed to insertOne
        return collection.insertOne({recordID: recordID, recordText: recordText}, done);
    }

    done('No recordID and/or recordText');
}

function runOnFile(file, done) {
    //  moved to be async
    fs.readFile(file, function(error, data) {
        if (error) {
            return done(error);
        }

        var json = JSON.parse(data);

        if (!json) {
            return done('Unable to parse JSON: ' + file);
        }

        insertRecord(json, done);
    });
}

function processFiles(files, done) {
    var next = files.length ? files.shift() : null;

    if (next) {
        return runOnFile(next, function(error) {
            if (error) {
                console.error(error);
                // you may or may not want to stop here by throwing an Error
            }

            processFiles(files, done);
        });
    }

    done();
}

MongoClient.connect(mongoDSN, function(error, db) {
    if (error) {
        throw new Error(error);
    }

    collection = db.collection('users');

    glob('imports/**/*.json', function(error, files) {
        if (error) {
            throw new Error(error);
        }

        processFiles(files, function() {
            console.log('all done');
            db.close();
        });
    });
});

NOTE: You can collect multiple "entry"-records to leverage the performance gain of multiple inserts using insertMany, though I have the feeling the inserted records are more complicated than described and it might give some memory issues if not handled correctly.

like image 123
Rogier Spieker Avatar answered Oct 25 '22 13:10

Rogier Spieker


Just structure your data into one big array of objects, then run db.collection.insertMany.

like image 38
Giovanni Lobitos Avatar answered Oct 25 '22 14:10

Giovanni Lobitos


I suggest you doing this using Promises:

const Bluebird = require('bluebird');
const glob = Bluebird.promisify(require('glob'));
const mongodb = require('mongodb');
const fs = Bluebird.promisifyAll(require('fs'));
const Path = require('path');
const MongoClient = mongodb.MongoClient;

const insertMillionsFromPath = Bluebird.coroutine(function *(path, mongoConnString) {
    const db = yield MongoClient.connect(mongoConnString);
    try {
        const collection = db.collection('users');
        const files = yield glob(Path.join(path, "*.json"));
        yield Bluebird.map(
            files,
            Bluebird.coroutine(function *(filename) {
                console.log("reading", filename);
                const fileContent = yield fs.readFileAsync(filename);
                const obj = JSON.parse(fileContent);

                console.log("inserting", filename);
                yield collection.insertOne(obj);
            }),
            {concurrency: 10} // You can increase concurrency here
        );
    } finally {
        yield db.close();
    }
});

insertMillionsFromPath("./myFiles", "mongodb://localhost:27017/database")
    .then(()=>console.log("OK"))
    .catch((err)=>console.log("ERROR", err));

In order to work, you will need to install the following packages:

npm install --save mongodb bluebird glob

and you will need to use node.js version 6 or greater, otherwise you will need to transpile your javascript (due to function *() generators usage).

like image 23
David Rissato Cruz Avatar answered Oct 25 '22 13:10

David Rissato Cruz