Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select query skips records during concurrent updates

i have table that processed concurrently by N threads.

CREATE TABLE [dbo].[Jobs]
(
    [Id]                    BIGINT          NOT NULL    CONSTRAINT [PK_Jobs] PRIMARY KEY IDENTITY,
    [Data]                  VARBINARY(MAX)  NOT NULL,
    [CreationTimestamp]     DATETIME2(7)    NOT NULL,
    [Type]                  INT             NOT NULL,
    [ModificationTimestamp] DATETIME2(7)    NOT NULL,
    [State]                 INT             NOT NULL,
    [RowVersion]            ROWVERSION      NOT NULL,
    [Activity]              INT                 NULL,
    [Parent_Id]             BIGINT              NULL
)
GO

CREATE NONCLUSTERED INDEX [IX_Jobs_Type_State_RowVersion] ON [dbo].[Jobs]([Type], [State], [RowVersion] ASC) WHERE ([State] <> 100)
GO

CREATE NONCLUSTERED INDEX [IX_Jobs_Parent_Id_State] ON [dbo].[Jobs]([Parent_Id], [State] ASC)
GO

Job is adding to table with State=0 (New) — it can be consumed by any worker in this state. When worker gets this queue item, State changed to 50 (Processing) and job becomes unavailable for other consumers (workers call [dbo].[Jobs_GetFirstByType] with arguments: Type=any, @CurrentState=0, @NewState=50).

CREATE PROCEDURE [dbo].[Jobs_GetFirstByType]
    @Type           INT,
    @CurrentState   INT,
    @NewState       INT
AS
BEGIN

    SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

    DECLARE @JobId BIGINT;

    BEGIN TRAN

        SELECT      TOP(1)
                    @JobId = Id
        FROM        [dbo].[Jobs] WITH (UPDLOCK, READPAST)
        WHERE       [Type] = @Type AND [State] = @CurrentState
        ORDER BY    [RowVersion];


        UPDATE  [dbo].[Jobs]

        SET     [State] = @NewState,
                [ModificationTimestamp] = SYSUTCDATETIME()

        OUTPUT  INSERTED.[Id]
                ,INSERTED.[RowVersion]
                ,INSERTED.[Data]
                ,INSERTED.[Type]
                ,INSERTED.[State]
                ,INSERTED.[Activity]

        WHERE   [Id] = @JobId;

    COMMIT TRAN

END

After processing, job State can be changed to 0 (New) again or it can be once set to 100 (Completed).

CREATE PROCEDURE [dbo].[Jobs_UpdateStatus]
    @Id         BIGINT,
    @State      INT,
    @Activity   INT
AS
BEGIN

    UPDATE  j

    SET     j.[State] = @State,
            j.[Activity] = @Activity,
            j.[ModificationTimestamp] = SYSUTCDATETIME()

    OUTPUT  INSERTED.[Id], INSERTED.[RowVersion]

    FROM    [dbo].[Jobs] j

    WHERE   j.[Id] = @Id;

END

Jobs has hierarchical structure, parent job gets State=100 (Completed) only when all childs are completed. Some workers call stored procedures ([dbo].[Jobs_GetCountWithExcludedState] with @ExcludedState=100) that returns number of incompleted jobs, when it returns 0, parent job State can be set to 100 (Completed).

CREATE PROCEDURE [dbo].[Jobs_GetCountWithExcludedState]
    @ParentId       INT,
    @ExcludedState  INT
AS
BEGIN

    SET TRANSACTION ISOLATION LEVEL READ COMMITTED;

    SELECT  COUNT(1)

    FROM    [dbo].[Jobs]

    WHERE   [Parent_Id] = @ParentId
    AND     [State] <> @ExcludedState

END

The main problem is strange behaviour of this stored procedure. Sometimes it returns 0 for parent job, but it exactly has incompleted jobs. I tryied turn on change data tracking and some debug information (including profiling) — child jobs 100% doesn't have State=100 when SP return 0. It seems that the SP skips records, that are not in 100 (Completed) state, but why it happen and how we can prevent this?

UPD: Calling [dbo].[Jobs_GetCountWithExcludedState] starts when parent job has childs. There сan be no situation when worker starts checking child jobs without their existence, because creating childs and setting to parent job checking activity wrapped in transaction:

using (var ts = new TransactionScope())
{
    _jobManager.AddChilds(parentJob);

    parentJob.State = 0;
    parentJob.Activity = 30; // in this activity worker starts checking child jobs

    ts.Complete();
}
like image 303
Boo Avatar asked Oct 23 '15 06:10

Boo


1 Answers

It would be very disturbing if in fact your procedure Jobs_GetCountWithExcludedState was returning a count of 0 records when there were in fact committed records matching your criteria. It's a pretty simple procedure. So there are two possibilities:

  • The query is failing due to an issue with SQL Server or data corruption.
  • There actually are no committed records matching the criteria at the time the procedure is run.

Corruption is an unlikely, but possible cause. You can check for corruption with DBCC CHECKDB.

Most likely there really are no committed job records that have a Parent_ID equal to the @ParentId parameter and are not in a state of 100 at the time it is run.

I emphasize committed because that's what the transaction will see.

You never really explain in your question how the Parent_ID gets set on the jobs. My first thought is that maybe you are checking for unprocessed child jobs and it finds none, but then another process adds it as the Parent_ID of another incomplete job. Is this a possibility?

I see you added an update to show that when you add a child job record that the update of the parent and child records are wrapped in a transaction. This is good, but not the question I was asking. This is the scenario that I am considering as a possibility:

  • A Job Record is inserted and committed for the parent.
  • Jobs_GetFirstByType grabs the parent job.
  • A worker thread processes it and calls Jobs_UpdateStatus and updates it's status to 100.
  • Something calls Jobs_GetCountWithExcludedState with the job and returns 0.
  • A child job is created and attached to the completed parent job record... which makes it now incomplete again.

I'm not saying that this is what is happening... I'm just asking if it's possible and what steps are you taking to prevent it? For example, in your code above in the update to your question you are selecting a ParentJob to attach the child to outside of the transaction. Could it be that you are selecting a parent job and then it gets completed before you run the transaction that adds the child to the parent? Or maybe the last child job of a parent job completes so the worker thread checks and marks the parent complete, but some other worker thread has already selected the job to be the parent for a new child job?

There are many different scenarios that could cause the symptom you are describing. I believe that the problem is to be found in some code that you have not shared with us yet particularly about how jobs are created and the code surrounding calls to Jobs_GetCountWithExcludedState. If you can give more information I think you will be more likely to find a usable answer, otherwise the best we can do is guess all the things that could happen in the code we can't see.

like image 200
Brian Pressler Avatar answered Oct 16 '22 08:10

Brian Pressler