Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NEST's method IndexMany to run synchronously

I run into small problem using NEST's method IndexMany (bulk index). I found out that when I send some amount of items to elasticsearch to be indexed, response is returned imidiately, but not all documents are indexed at this point.

The problem could be easily shown on the following code:

List<object> objectToIndex = new List<object>(); // assume 3000 items here
ElasticClient client = new ElasticClient(settings);
client.IndexMany(objectsToIndex, indexName, type);

var readResult = client.Search<T>(e => e
    .Type(type)
    .Index(indexName)
    .Query(q => q
        .Range(r => r.OnField(t => t.Date).GreaterOrEquals(dates[0]).LowerOrEquals(dates[1]))
    )
);
// read result contains only 300-500 items

System.Threading.Thread.Sleep(2000);

readResult = client.Search<T>(e => e
    .Type(type)
    .Index(indexName)
    .Query(q => q
        .Range(r => r.OnField(t => t.Date).GreaterOrEquals(dates[0]).LowerOrEquals(dates[1]))
    )
);
// readResult contains all 3000 items right now

This is problem for me, because I need to bulk index all documents and then read them all. Sure, I can run Thread.Sleep(..) after the bulk index, but that is not solution for me.

Elasticsearch version is 2.2.0 and NEST client version is 1.7.2.

So, is there a way to force elastic/NEST to wait until all documents are indexed before continue?

like image 940
Martin Brabec Avatar asked Mar 23 '26 07:03

Martin Brabec


1 Answers

NEST 2.x is not compatible with Elasticsearch 1.x; whilst it may work for the most part, it is untested against 1.x and there are breaking changes between Elasticsearch 1.x and 2.x that are reflected in changes in NEST, for example, server error responses, that would result in a serialization exception at runtime. You should use the latest NEST/Elasticsearch.Net 1.x (currently 1.8.0) with Elasticsearch 1.x.

There's a trade off here to be made between indexing rate and allowing newly indexed items to be available for search. By changing the refresh interval from 1 second to something longer such as 30 seconds, or disabling it completely whilst indexing (-1) and then setting back to 1 second after finishing, you may see a better indexing rate at the cost of needing to wait longer after indexing for documents to be available for search. In contrast, if having items indexed being available for search as soon as possible is more important, then you may send smaller bulk batch sizes with a call to refresh in the request such as

client.Bulk(b => b
    .CreateMany(objectToIndex, (c, doc) => c
        .Document(doc)
        .Type(type)
        .Index(indexName)
    )
    .Refresh()
);

with the caveat that calling refresh more often is likely to increase load on the cluster and indexing is going to take longer.

If you absolutely must wait until all documents have been indexed, I would recommend doing a count over a search, to reduce the size of the response that needs to be deserialized

var countResponse = client.Count<MyClass>(c => c
    .Type(type)
    .Index(indexName)
        .Query(q => q
            .Range(r => r
                .OnField(t => t.Date)
                .GreaterOrEquals(dates[0])
                .LowerOrEquals(dates[1])
            )
        )
    );

var count = countResponse.Count;
like image 188
Russ Cam Avatar answered Mar 24 '26 20:03

Russ Cam