i have table that processed concurrently by N threads.
CREATE TABLE [dbo].[Jobs]
(
[Id] BIGINT NOT NULL CONSTRAINT [PK_Jobs] PRIMARY KEY IDENTITY,
[Data] VARBINARY(MAX) NOT NULL,
[CreationTimestamp] DATETIME2(7) NOT NULL,
[Type] INT NOT NULL,
[ModificationTimestamp] DATETIME2(7) NOT NULL,
[State] INT NOT NULL,
[RowVersion] ROWVERSION NOT NULL,
[Activity] INT NULL,
[Parent_Id] BIGINT NULL
)
GO
CREATE NONCLUSTERED INDEX [IX_Jobs_Type_State_RowVersion] ON [dbo].[Jobs]([Type], [State], [RowVersion] ASC) WHERE ([State] <> 100)
GO
CREATE NONCLUSTERED INDEX [IX_Jobs_Parent_Id_State] ON [dbo].[Jobs]([Parent_Id], [State] ASC)
GO
Job is adding to table with State=0 (New)
— it can be consumed by any worker in this state. When worker gets this queue item, State
changed to 50 (Processing)
and job becomes unavailable for other consumers (workers call [dbo].[Jobs_GetFirstByType]
with arguments: Type=any, @CurrentState=0, @NewState=50
).
CREATE PROCEDURE [dbo].[Jobs_GetFirstByType]
@Type INT,
@CurrentState INT,
@NewState INT
AS
BEGIN
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
DECLARE @JobId BIGINT;
BEGIN TRAN
SELECT TOP(1)
@JobId = Id
FROM [dbo].[Jobs] WITH (UPDLOCK, READPAST)
WHERE [Type] = @Type AND [State] = @CurrentState
ORDER BY [RowVersion];
UPDATE [dbo].[Jobs]
SET [State] = @NewState,
[ModificationTimestamp] = SYSUTCDATETIME()
OUTPUT INSERTED.[Id]
,INSERTED.[RowVersion]
,INSERTED.[Data]
,INSERTED.[Type]
,INSERTED.[State]
,INSERTED.[Activity]
WHERE [Id] = @JobId;
COMMIT TRAN
END
After processing, job State
can be changed to 0 (New)
again or it can be once set to 100 (Completed)
.
CREATE PROCEDURE [dbo].[Jobs_UpdateStatus]
@Id BIGINT,
@State INT,
@Activity INT
AS
BEGIN
UPDATE j
SET j.[State] = @State,
j.[Activity] = @Activity,
j.[ModificationTimestamp] = SYSUTCDATETIME()
OUTPUT INSERTED.[Id], INSERTED.[RowVersion]
FROM [dbo].[Jobs] j
WHERE j.[Id] = @Id;
END
Jobs has hierarchical structure, parent job gets State=100 (Completed)
only when all childs are completed.
Some workers call stored procedures ([dbo].[Jobs_GetCountWithExcludedState]
with @ExcludedState=100
) that returns number of incompleted jobs, when it returns 0, parent job State
can be set to 100 (Completed)
.
CREATE PROCEDURE [dbo].[Jobs_GetCountWithExcludedState]
@ParentId INT,
@ExcludedState INT
AS
BEGIN
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
SELECT COUNT(1)
FROM [dbo].[Jobs]
WHERE [Parent_Id] = @ParentId
AND [State] <> @ExcludedState
END
The main problem is strange behaviour of this stored procedure. Sometimes it returns 0 for parent job, but it exactly has incompleted jobs. I tryied turn on change data tracking and some debug information (including profiling) — child jobs 100% doesn't have State=100
when SP return 0.
It seems that the SP skips records, that are not in 100 (Completed)
state, but why it happen and how we can prevent this?
UPD:
Calling [dbo].[Jobs_GetCountWithExcludedState]
starts when parent job has childs. There сan be no situation when worker starts checking child jobs without their existence, because creating childs and setting to parent job checking activity wrapped in transaction:
using (var ts = new TransactionScope())
{
_jobManager.AddChilds(parentJob);
parentJob.State = 0;
parentJob.Activity = 30; // in this activity worker starts checking child jobs
ts.Complete();
}
It would be very disturbing if in fact your procedure Jobs_GetCountWithExcludedState
was returning a count of 0 records when there were in fact committed records matching your criteria. It's a pretty simple procedure. So there are two possibilities:
Corruption is an unlikely, but possible cause. You can check for corruption with DBCC CHECKDB.
Most likely there really are no committed job records that have a Parent_ID
equal to the @ParentId
parameter and are not in a state of 100 at the time it is run.
I emphasize committed because that's what the transaction will see.
You never really explain in your question how the Parent_ID
gets set on the jobs. My first thought is that maybe you are checking for unprocessed child jobs and it finds none, but then another process adds it as the Parent_ID
of another incomplete job. Is this a possibility?
I see you added an update to show that when you add a child job record that the update of the parent and child records are wrapped in a transaction. This is good, but not the question I was asking. This is the scenario that I am considering as a possibility:
Jobs_GetFirstByType
grabs the parent job.Jobs_UpdateStatus
and updates it's status to 100.Jobs_GetCountWithExcludedState
with the job and returns 0.I'm not saying that this is what is happening... I'm just asking if it's possible and what steps are you taking to prevent it? For example, in your code above in the update to your question you are selecting a ParentJob
to attach the child to outside of the transaction. Could it be that you are selecting a parent job and then it gets completed before you run the transaction that adds the child to the parent? Or maybe the last child job of a parent job completes so the worker thread checks and marks the parent complete, but some other worker thread has already selected the job to be the parent for a new child job?
There are many different scenarios that could cause the symptom you are describing. I believe that the problem is to be found in some code that you have not shared with us yet particularly about how jobs are created and the code surrounding calls to Jobs_GetCountWithExcludedState
. If you can give more information I think you will be more likely to find a usable answer, otherwise the best we can do is guess all the things that could happen in the code we can't see.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With