Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mongo Change Streams running multiple times (kind of): Node app running multiple instances

My Node app uses Mongo change streams, and the app runs 3+ instances in production (more eventually, so this will become more of an issue as it grows). So, when a change comes in the change stream functionality runs as many times as there are processes.

How to set things up so that the change stream only runs once?

Here's what I've got:

const options = { fullDocument: "updateLookup" };

const filter = [
  {
    $match: {
      $and: [
        { "updateDescription.updatedFields.sites": { $exists: true } },
        { operationType: "update" }
      ]
    }
  }
];

const sitesStream = Client.watch(sitesFilter, options);

// Start listening to site stream
sitesStream.on("change", async change => {
  console.log("in site change stream", change);
  console.log(
    "in site change stream, update desc",
    change.updateDescription
  );

  // Do work...
  console.log("site change stream done.");
  return;
});
like image 238
PaulIsLoud Avatar asked Aug 24 '18 15:08

PaulIsLoud


3 Answers

It can easily be done with only Mongodb query operators. You can add a modulo query on the ID field where the divisor is the number of your app instances (N). The remainder is then an element of {0, 1, 2, ..., N-1}. If your app instances are numbered in ascending order from zero to N-1 you can write the filter like this:

const filter = [
  {
    "$match": {
      "$and": [
        // Other filters
        { "_id": { "$mod": [<number of instances>, <this instance's id>]}}
      ]
    }
  }
];
like image 145
Marc Avatar answered Nov 10 '22 15:11

Marc


Doing this with strong guarantees is difficult but not impossible. I wrote about the details of one solution here: https://www.alechenninger.com/2020/05/building-kafka-like-message-queue-with.html

The examples are in Java but the important part is the algorithm.

It comes down to a few techniques:

  • Each process attempts to obtain a lock
  • Each lock (or each change) has an associated fencing token
  • Processing each change must be idempotent
  • While processing the change, the token is used to ensure ordered, effectively-once updates.

More details in the blog post.

like image 26
Alec Henninger Avatar answered Nov 10 '22 15:11

Alec Henninger


It sounds like you need a way to partition updates between instances. Have you looked into Apache Kafka? Basically what you would do is have a single application that writes the change data to a partitioned Kafka Topic and have your node application be a Kafka consumer. This would ensure only one application instance ever receives an update.

Depending on your partitioning strategy, you could even ensure that updates for the same record always go to the same node app (if your application needs to maintain its own state). Otherwise, you can spread out the updates in a round robin fashion.

The biggest benefit to using Kafka is that you can add and remove instances without having to adjust configurations. For example, you could start one instance and it would handle all updates. Then, as soon as you start another instance, they each start handling half of the load. You can continue this pattern for as many instances as there are partitions (and you can configure the topic to have 1000s of partitions if you want), that is the power of the Kafka consumer group. Scaling down works in the reverse.

like image 43
jteezy14 Avatar answered Nov 10 '22 13:11

jteezy14