TFS Build Agent - Waiting for an agent to be requested

Question

I am in the process of testing a TFS 2013 to TFS 2018 onprem upgrade. I have installed 2018.1 on a new system (and upgraded a copy of my TFS databases). I have installed a build agent on a new host which shows up under Agent Queues (as online and enabled).

I'm now trying to create a build. I set things up as I feel they should be and it sits at this screen:

Build 
Waiting for an available agent
Console

Waiting for an agent to be requested

The VSTS Agent service is running on the build agent system. so I feel that is OK. I'm somewhat at a loss. Any assistance is appreciated.

Andy Li-MSFT · Accepted Answer

Just try the below items to narrow down the issue:

Check the build definition requirements (Demands section) and the agent offering. Make sure it has the required capabilities installed on the agent machine.
- When a build is queued, the system sends the job only to agents that have the capabilities demanded by the build definition.
Check if the service "Visual Studio Team Foundation Background Job Agent" is running on the TFS application tier server.
- If it's not started, just start the service.
- If the status is Running, just try to Restart the service.
Make sure the account that the agent is run under is in the "Agent Pool Service Account" role.
Try to change a domain account which is a member of the Build Agent Service Accounts group and belongs to "Agent Pool Service Account" role, to see whether the agent would work or not.

Enter image description here

Mike · Answer

We have just spent five days trying to diagnose this issue and believe we have finally nailed the cause (and the solution!).

TL;DR version:

We're using TFS 2017 Update 3, YMMV. We believe the problem is a result of a badly configured old version of an Elastic Search component which is used by the Code Search extension. If you do not use the Code Search feature please disable or uninstall this extension and report back - we have seen huge improvements as a result.

Detailed explanation:

So what we discovered was that MS have repurposed an Elastic Search component to provide the code search facility within TFS - the service is installed when TFS is installed if you choose to include the search feature.

For those unfamiliar with Elastic, one particularly important aspect is that it uses a multi-node architecture, shifting load between nodes and balancing the workload across the cluster and herein lies the MS Code Search problem.

The Elastic Search component installed in TFS is (badly) configured to be single node, with a variety of features intentionally suppressed or disabled. With the high water-mark setting set to 85%, as soon as the search data reaches 85% of the available disk space on the data drive, the node stops creating new indexes and will only accept data to existing indexes.

In a normal Elastic cluster, this would cause another node to create a new index to accept the new data but, since MS have lobotomised the cluster down to one node, the fall-back... is the same node - rinse and repeat.

The behaviour we saw, looking at the communications between the build agent and the build controller, suggests that the Build Controller tries to communicate with Elastic and eventually fails. Over time, Elastic becomes more unresponsive and chokes this communication which manifests as the controller taking longer and longer to respond to build requests.

It is only because we actually use Elastic Search that we were able to interpret the behaviour and logs to come to this conclusion. Without that knowledge it would be almost impossible to determine the actual cause.

How to fix this?

There are a number of ways that you can fix this:

Don't install the TFS Search feature

If you don't want to use the Code Search feature, don't install it. The problem will not occur.

Remove the TFS Search feature [what we did]

If you don't use the Code Search feature, uninstall it. The problem will go away - you can either just disable the extension in all collections or you can use the server installer to fully remove it. I have detailed instructions from MS for anyone who wants to eradicate it completely, just ask.

Point the Search feature to a properly configured, real Elastic cluster

If you use Elastic properly, rather than stuffing it in a small box on its own, the problem will not occur.

Ensure the search data disk never hits the 85% water-mark

Elastic will continue to function "properly" and should return search results as expected, within the limited parameters.

Hope this helps someone out there avoid the pain we suffered.

TFS Build Agent - Waiting for an agent to be requested

Tags:

build

tfs

user2912826

2 Answers

Andy Li-MSFT

Mike

Recent Activity

Donate For Us

TFS Build Agent - Waiting for an agent to be requested

Tags:

build

tfs

user2912826

2 Answers

Andy Li-MSFT

Mike

Related questions

Recent Activity

Donate For Us