How to wait until all bulk writes are completed in elastic search api

Tags:

elasticsearch

Using NodeJS elastic search client. Trying to write a data importer to bulk import documents from MongoDB. The problem I'm having is the index refresh doesn't seem to wait until all documents are written to elastic before checking the counts.

Using the streams API in node to read the records into a batch, then using the elastic API bulk command to write the records. Shown below:

function rebuildIndex(modelName, queryStream, openStream, done) {
    logger.debug('Rebuilding %s index', modelName);
    async.series([
        function (next) {
          deleteType(modelName, function (err, result) {
            next(err, result);
          });
        },
        function (next) {
          var Model;
          var i = 0;
          var batchSize = settings.indexBatchSize;
          var batch = [];
          var stream;

          if (queryStream && !openStream) {
            stream = queryStream.stream();
          } else if (queryStream && openStream) {
            stream = queryStream;
          }else
          {
            Model = mongoose.model(modelName);
            stream = Model.find({}).stream();
          }

          stream.on("data", function (doc) {
            logger.debug('indexing %s', doc.userType);
            batch.push({
              index: {
                "_index": settings.index,
                "_type": modelName.toLowerCase(),
                "_id": doc._id.toString()
              }
            });
            var obj;
            if (doc.toObject){
              obj = doc.toObject();
            }else{
              obj = doc;
            }
            obj = _.clone(obj);

            delete obj._id;
            batch.push(obj);
            i++;
            if (i % batchSize == 0) {
              console.log(chalk.green('Loaded %s records'), i);
              client().bulk({
                body: batch
              }, function (err, resp) {
                if (err) {
                  next(err);
                } else if (resp.errors) {
                  next(resp);
                }
              });
              batch = [];
            }
          });

          // When the stream ends write the remaining records
          stream.on("end", function () {
            if (batch.length > 0) {
              console.log(chalk.green('Loaded %s records'), batch.length / 2);
              client().bulk({
                body: batch
              }, function (err, resp) {
                if (err) {
                  logger.error(err, 'Failed to rebuild index');
                  next(err);
                } else if (resp.errors) {
                  logger.error(resp.errors, 'Failed to rebuild index');
                  next(resp);
                } else {
                  logger.debug('Completed rebuild of %s index', modelName);
                  next();
                }
              });
            } else {
              next();
            }

            batch = [];
          })
        }

      ],
      function (err) {
        if (err)
          logger.error(err);
        done(err);
      }
    );
  }

I use this helper to check the document counts in the index. Without the timeout, the counts in the index are wrong, but with the timeout they're okay.

/**
   * A helper function to count the number of documents in the search index for a particular type.
   * @param type The type, e.g. User, Customer etc.
   * @param done A callback to report the count.
   */
  function checkCount(type, done) {
    async.series([
      function(next){
        setTimeout(next, 1500);
      },
      function (next) {
        refreshIndex(next);
      },
      function (next) {
        client().count({
          "index": settings.index,
          "type": type.toLowerCase(),
          "ignore": [404]
        }, function (error, count) {
          if (error) {
            next(error);
          } else {
            next(error, count.count);
          }
        });
      }
    ], function (err, count) {
      if (err)
        logger.error({"err": err}, "Could not check index counts.");
      done(err, count[2]);
    });
  }

And this helper is supposed to refresh the index after the update completes:

// required to get results to show up immediately in tests. Otherwise there's a 1 second delay
  // between adding an entry and it showing up in a search.
  function refreshIndex(done) {
    client().indices.refresh({
      "index": settings.index,
      "ignore": [404]
    }, function (error, response) {
      if (error) {
        done(error);
      } else {
        logger.debug("deleted index");
        done();
      }
    });
  }

The loader works okay, except this test fails because of timing between the bulk load and the count check:

it('should be able to rebuild and reindex customer data', function (done) {
    this.timeout(0); // otherwise the stream reports a timeout error
    logger.debug("Testing the customer reindexing process");

    // pass null to use the generic find all query
    searchUtils.rebuildIndex("Customer", queryStream, false, function () {
      searchUtils.checkCount("Customer", function (err, count) {
        th.checkSystemErrors(err, count);
        count.should.equal(volume.totalCustomers);
        done();
      })
    });
  });

I observe random results in the counts from the tests. With the artificial delay (setTimeout in the checkCount function) then the counts match. So I conclude that the documents are eventually written to elastic and the test would pass. I thought the indices.refresh would essentially force a wait until the documents are all written to the index, but it doesn't seem to be working with this approach.

The setTimeout hack is not really sustainable when the volume goes to actual production level....so how can I ensure the bulk calls are completely written to elastic index before checking the count of documents?

511

asked Apr 16 '16 06:04

Richard G

1 Answers

Take a look at the "refresh" parameter (elasticsearch documentation)

For example:

let bulkUpdatesBody = [ bulk actions / docs to index go here ]
client.bulk({
  refresh: "wait_for",
  body: bulkUpdatesBody
});

138

answered Oct 21 '22 08:10

Troy

Related questions
                            
                                Elasticsearch - Efficiency of search across multiple types
                            
                                elasticsearch-rails VS (re)tire gem (Elasticsearch and Rails 3.2)
                            
                                How to connect to remote server using Elasticsearch Node Client Java
                            
                                Elasticsearch search fails in field with special character and wildcard
                            
                                How to delete unassigned shards in elasticsearch?
                            
                                Very slow elasticsearch term aggregation. How to improve?
                            
                                Is it safe to expose the Elasticsearch Search API directly through your application's API?
                            
                                ElasticSearch aggregation with Java
                            
                                How to return actual value (not lowercase) when performing search with terms aggregation?
                            
                                Boost for a Bool query on Elasticsearch having little effect
                            
                                Elasticsearch for multiple sites (sources)
                            
                                Does ElasticSearch support Unicode / Chinese?
                            
                                Elasticsearch change field date format
                            
                                How to deal with elasticsearch delay when doing unit test?
                            
                                ElasticSearch - how to exclude filter from aggregations?
                            
                                ElasticSearch Filtering aggregations from array field
                            
                                Inner hits not working with nested filter?
                            
                                ElasticSearch: How to write query where string field is either null or empty?
                            
                                Using minimum_should_match in filtered elasticSearch query
                            
                                How to max out CPU cores on Elasticsearch cluster

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With