Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does DocumentDb fail sporadically when run in a test scenario

We have a project that is currently under development where we are using Azure DocumentDb as the data repository. Its been going great, I really like how it works and how it enables rapid development, but recently our integration tests have started to fail.

Our integration tests create and tear down a collection within a DB every time the tests run. I wonder if its this process that has 'broken' the database in some way.

I have stripped down our project to its bare bones and checked it in here: https://github.com/DamianStanger/DocumentDbDemo

When i run the tests I get the following error:

System.AggregateException : One or more errors occurred.
  ----> Microsoft.Azure.Documents.DocumentClientException : Message:  {"Errors":["Resource with specified id or name already exists"]}
ActivityId: e273b9d6-b571-43d3-9802-c7d7c819a3f0, Request URI: /apps/c9c8f510-0ca7-4702-aa6c-9c596d797367/services/507e2a70-c787-437c-9587-0ff4341bc265/partitions/ae4ca317-e883-4419-84f9-c8d053ffc73d/replicas/131159218637566393p
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at DocumentDbDemo.Data.AggregateRepository.CreateCollectionIfNotExists() in K:\_code\VisualStudio\DocumentDbPerfTests\DocumentDbDemo.Data\AggregateRepository.cs:line 32
   at DocumentDbDemo.Data.AggregateRepository..ctor(ConfigFactory configFactory) in K:\_code\VisualStudio\DocumentDbPerfTests\DocumentDbDemo.Data\AggregateRepository.cs:line 19
   at DocumentDbDemo.Data.Tests.AggregateRepositoryTests.ShouldReturnNullIfNotFound() in K:\_code\VisualStudio\DocumentDbPerfTests\DocumentDbDemo.Data.Tests\AggregateRepositoryTests.cs:line 24
--DocumentClientException

caused by a failure in the call to _client.ReadDocumentCollectionAsync within the AggregateRepository.cs. I do not understand this. The exception is concurring in the code that first checks if the collection exists (it does) then if it does not, it creates it. Clearly the create will fail, as the collection exists!!

The second type of failure is:

System.AggregateException : One or more errors occurred.
  ----> Microsoft.Azure.Documents.DocumentClientException : Message: {"Errors":["Owner resource does not exist"]}
ActivityId: 9e25516a-25fe-4bf3-a88d-6234c76ac47d, Request URI: /apps/c9c8f510-0ca7-4702-aa6c-9c596d797367/services/507e2a70-c787-437c-9587-0ff4341bc265/partitions/ae4ca317-e883-4419-84f9-c8d053ffc73d/replicas/131159551041924002s
   at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at DocumentDbDemo.Data.Tests.AggregateRepositoryTests.ShouldSaveNewAggregate(AggregateRepository aggregateRepository) in K:\_code\VisualStudio\DocumentDbPerfTests\DocumentDbDemo.Data.Tests\AggregateRepositoryTests.cs:line 48
   at DocumentDbDemo.Data.Tests.AggregateRepositoryTests.ShouldSaveAndReadTheDocument() in K:\_code\VisualStudio\DocumentDbPerfTests\DocumentDbDemo.Data.Tests\AggregateRepositoryTests.cs:line 42
--DocumentClientException

This is equally baffling, the collection again exists, but the document does not, we are creating it for the first time with a unique GUID. The code failing is in the call to _client.UpsertDocumentAsync again in the class AggregateRepository.cs

Reproduction

I have reproduced this many times using the code in the aforementioned github repo, but, using a specific documentDb database and collection. When i switch to a different, brand new DB the code and tests work as expected!

This is why i think its down to how we have been using a particular database. This project is a few weeks old now and all the tests were running just fine until yesterday where they really started to fail sporadically. Some times both would be green, or one or the other or both would fail.

A question i have would be is it a problem for documentDb if we create and delete a particular collection over and over, potentially many times a minute? or is there known failure cases if you do this?

I could of course just bin our test DB, create another and bury our heads hoping its a one off. But could this happen in prod? I really want to get to the bottom of this. Is it possible to see the internal state of that 'broken!' DB in any way?

NOTE:

I now get the failure even if i comment out / remove the clean function in the test class. So I don't think its an issue with async and await and the deleting of the collection before the read/write is finished.

Also note in my real project we don't do a loop as you will find in the test class, this is just so its easy for me (and you?) to run the tests multiple times until its fails. (which it wont for a new DB that you might have!)

like image 619
Damo Avatar asked Aug 19 '16 13:08

Damo


1 Answers

I believe the problem stems from the consistency level of Cosmos (see here). Basically a Cosmos database have a few local instances which you are accessing (through a semi-load-balancer). What happens (in a default consistency model) is that you are performing the update it is eventually written to all nodes.

If you want to ensure that reads will not fail you either need to use a strong consistency model, or to use a session and send the session token on the subsequent read

like image 184
Danny Shumer Avatar answered Nov 14 '22 23:11

Danny Shumer