Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

node-postgres with massive amount of queries

I just started playing around with node.js with postgres, using node-postgres. One of the things I tried to do is to write a short js to populate my database, using a file with about 200,000 entries.

I noticed that after sometime (less than 10 seconds), I start to get "Error: Connection terminated". I am not sure whether this is problem with how I use node-postgres, or if it's because I was spamming postgres.

Anyway, here is a simple code that shows this behaviour:

var pg = require('pg');
var connectionString = "postgres://xxxx:xxxx@localhost/xxxx";

pg.connect(connectionString, function(err,client,done){
  if(err) {
    return console.error('could not connect to postgres', err);
  }

  client.query("DROP TABLE IF EXISTS testDB");
  client.query("CREATE TABLE IF NOT EXISTS testDB (id int, first int, second int)");
  done();

  for (i = 0; i < 1000000; i++){
    client.query("INSERT INTO testDB VALUES (" + i.toString() + "," + (1000000-i).toString() + "," + (-i).toString() + ")",   function(err,result){
      if (err) {
         return console.error('Error inserting query', err);
      }
      done();
    });
  }
});

It fails after about 18,000-20,000 queries. Is this the wrong way to use client.query? I tried changing the default client number, but it didn't seem to help.

client.connect() doesn't seem to help either, but that was because I had too many clients, so I definitely think client pooling is the way to go.

Thanks for any help!

like image 333
Daniel Sutantyo Avatar asked Mar 17 '15 13:03

Daniel Sutantyo


2 Answers

UPDATE

This answer has been since superseded with this article: Data Imports, which represents the most up-to-date approach.


In order to replicate your scenario I used pg-promise library, and I can confirm that trying it head-on will never work, no matter which library you use, it is the approach that matters.

Below is a modified approach where we partition inserts into chunks and then execute each chunk within a transaction, which is load balancing (aka throttling):

function insertRecords(N) {
    return db.tx(function (ctx) {
        var queries = [];
        for (var i = 1; i <= N; i++) {
            queries.push(ctx.none('insert into test(name) values($1)', 'name-' + i));
        }
        return promise.all(queries);
    });
}
function insertAll(idx) {
    if (!idx) {
        idx = 0;
    }
    return insertRecords(100000)
        .then(function () {
            if (idx >= 9) {
                return promise.resolve('SUCCESS');
            } else {
                return insertAll(++idx);
            }
        }, function (reason) {
            return promise.reject(reason);
        });
}
insertAll()
    .then(function (data) {
        console.log(data);
    }, function (reason) {
        console.log(reason);
    })
    .done(function () {
        pgp.end();
    });

This produced 1000,000 records in about 4 minutes, dramatically slowing after the first 3 transactions. I was using Node JS 0.10.38 (64-bit), which consumed about 340MB of memory. This way we inserted 100,000 records, 10 times in a row.

If we do the same, only this time insert 10,000 records within 100 transactions, the same 1,000,000 records are added in just 1m25s, no slowing down, with Node JS consuming around 100MB of memory, which tells us that partitioning data like this is a very good idea.

It doesn't matter which library you use, the approach should be the same:

  1. Partition/throttle your inserts into multiple transactions;
  2. Keep the list of inserts in a single transaction at around 10,000 records;
  3. Execute all your transactions in a synchronous chain.
  4. Release connection back to the pool after each transaction's COMMIT.

If you break any of those rules, you're guaranteed trouble. For example, if you break rule 3, your Node JS process is likely to run out of memory real quick and throw an error. Rule 4 in my example was provided by the library.

And if you follow this pattern, you don't need to trouble yourself with the connection pool settings.

UPDATE 1

Later versions of pg-promise support such scenarios perfectly, as shown below:

function factory(index) {
    if (index < 1000000) {
        return this.query('insert into test(name) values($1)', 'name-' + index);
    }
}

db.tx(function () {
    return this.batch([
        this.none('drop table if exists test'),
        this.none('create table test(id serial, name text)'),
        this.sequence(factory), // key method
        this.one('select count(*) from test')
    ]);
})
    .then(function (data) {
        console.log("COUNT:", data[3].count);
    })
    .catch(function (error) {
        console.log("ERROR:", error);
    });

and if you do not want to include anything extra, like table creation, then it looks even simpler:

function factory(index) {
    if (index < 1000000) {
        return this.query('insert into test(name) values($1)', 'name-' + index);
    }
}

db.tx(function () {
    return this.sequence(factory);
})
    .then(function (data) {
        // success;
    })
    .catch(function (error) {
        // error;
    });

See Synchronous Transactions for details.

Using Bluebird as the promise library, for example, it takes 1m43s on my production machine to insert 1,000,000 records (without long stack traces enabled).

You would just have your factory method return requests according to the index, till you have none left, simple as that.

And the best part, this isn't just fast, but also creates little load on your NodeJS process. Memory test process stays under 60MB during the entire test, consuming only 7-8% of the CPU time.

UPDATE 2

Starting with version 1.7.2, pg-promise supports super-massive transactions with ease. See chapter Synchronous Transactions.

For example, I could insert 10,000,000 records in a single transaction in just 15 minutes on my home PC, with Windows 8.1 64-bit.

For the test I set my PC to production mode, and used Bluebird as the promise library. During the test, memory consumption didn't go over 75MB for the entire NodeJS 0.12.5 process (64-bit), while my i7-4770 CPU showed consistent 15% load.

Inserting 100m records the same way would require just more patience, but not more computer resources.

In the meantime, the previous test for 1m inserts dropped from 1m43s to 1m31s.

UPDATE 3

The following considerations can make a huge difference: Performance Boost.

UPDATE 4

Related question, with a better implementation example: Massive inserts with pg-promise.

UPDATE 5

A better and newer example can be found here: nodeJS inserting Data into PostgreSQL error

like image 77
vitaly-t Avatar answered Oct 21 '22 04:10

vitaly-t


I'm guessing that you are reaching max pool size. Since client.query is asynchronous, prolly all the available connections are used before they are returned.

Default Pool size is 10. Check here: https://github.com/brianc/node-postgres/blob/master/lib/defaults.js#L27

You can increase default pool size by setting pg.defaults.poolSize:

pg.defaults.poolSize = 20;

Update: Execute another query after freeing a connection.

var pg = require('pg');
var connectionString = "postgres://xxxx:xxxx@localhost/xxxx";
var MAX_POOL_SIZE = 25;

pg.defaults.poolSize = MAX_POOL_SIZE;
pg.connect(connectionString, function(err,client,done){
  if(err) {
    return console.error('could not connect to postgres', err);
  }

  var release = function() {
    done();
    i++;
    if(i < 1000000)
      insertQ();
  };

  var insertQ = function() {
    client.query("INSERT INTO testDB VALUES (" + i.toString() + "," + (1000000-i).toString() + "," + (-i).toString() + ")",        function(err,result){
      if (err) {
         return console.error('Error inserting query', err);
      }
      release();
    });
  };

  client.query("DROP TABLE IF EXISTS testDB");
  client.query("CREATE TABLE IF NOT EXISTS testDB (id int, first int,    second int)");
  done();

  for (i = 0; i < MAX_POOL_SIZE; i++){
    insertQ();
  }
});

The basic idea is since you are enqueuing a large number of queries with relatively small connection pool size, you are reaching max pool size. Here we make new query only after an existing connection is freed.

like image 29
Anurag Peshne Avatar answered Oct 21 '22 04:10

Anurag Peshne