Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DocumentDB returning different result count every time on large paged fetch

The Problem

I'm trying to pull roughly 20,000 documents out of docdb. To reduce the amount I'm fetching, I'm only pulling out the document's ID. I'm paging through the results and when I've built up a list after it's done, I've found that the number of results changes (and they're all wrong!). It's not getting all of the results it should. There are no other processes that are currently touching the data.

Here's an idea of how I'm querying:

var sql = new SqlQuerySpec(@"SELECT items.id FROM items WHERE items.foo = @bar")
{
    Parameters = new SqlParameterCollection { new SqlParameter("@bar", bar) }
};

var feedOptions = new FeedOptions { MaxItemCount = 2000 };
var query = this.Context.Client.CreateDocumentQuery<Foo>(dbSelfLink, sql, feedOptions).AsDocumentQuery();

var ids = new List<string>();

while (query.HasMoreResults)
{
    var page = await query.ExecuteNextAsync<Foo>().ConfigureAwait(false);
    ids.AddRange(page.AsEnumerable());
}

I ran this three times while debugging and got three different ID counts:

19,323
19,321
19,327

And none of these are quite as high as I expect them. They're within 100 or so of what I expect.

What I've Ruled Out

  • It isn't a deserialization problem because I've tried changing the query to select an ID that wasn't in the result set and it grabs it fine.

  • It isn't the query itself, as I can run the same query in the Azure portal's query explorer with an additional AND items.id = '...' looking for a specific missing ID and it returns it.

  • It isn't the way I'm doing paging, because I've used this same method elsewhere for smaller queries (~100 results) and it works fine.

  • It isn't because I'm crossing the data center boundary, as this happens in production as well.

What I Suspect

It could be that I'm being rate limited (my two collections are S2), but I'm not sure why that would change the number of results, or how I would prove that. I may try to upgrade the collections to S3 to see if that fixes the problem, but that wouldn't work for me long-term.

Other than that, I'd really appreciate any insight into what might be happening. Thanks!

Edit

I tried upgrading the collections. No change.

Edit 2 with response to comments

Yes, we are using lazy indexing...I was beginning to wonder if that was the culprit but not sure why that would result in not getting all of our records. If this needs to change, that's probably okay, as the indexing was originally done without much thought (and it's using an old format which you can see here). Below is the indexing policy:

{
  "indexingMode": "Lazy",
  "automatic": true,
  "includedPaths": [
    {
      "path": "/",
      "indexes": [
        {
          "kind": "Hash",
          "dataType": "Number",
          "precision": 3
        },
        {
          "kind": "Hash",
          "dataType": "String",
          "precision": 3
        }
      ]
    },
    {
      "path": "/\"_ts\"/?",
      "indexes": [
        {
          "kind": "Range",
          "dataType": "Number",
          "precision": 6
        },
        {
          "kind": "Hash",
          "dataType": "String",
          "precision": 3
        }
      ]
    }
  ],
  "excludedPaths": []
}

Regarding the MaxItemCount, I have also tried with -1 and saw similar behavior.

like image 503
MattM Avatar asked Oct 31 '22 12:10

MattM


1 Answers

This is because you're using the Lazy indexing mode. Lazy indexing as the name suggests is performed as a low priority process relative to writes and serves "eventually consistent results" to queries.

Since DocumentDB maintains multiple replicas of every collection (for HA and performance), you might have replicas that are further ahead or behind in terms of which documents are indexed.

Because each page request within a query is routed randomly by the client/gateway for load-balancing, you can see results that can contain/miss results based on which replicas are selected. Eventually, when indexing is up to date on all replicas, queries will return the same number of results.

If you would like consistent query results, you can switch the collection to use Consistent indexing mode.

like image 188
Aravind Krishna R. Avatar answered Nov 15 '22 05:11

Aravind Krishna R.