I am in the process of testing a TFS 2013 to TFS 2018 onprem upgrade. I have installed 2018.1 on a new system (and upgraded a copy of my TFS databases). I have installed a build agent on a new host which shows up under Agent Queues (as online and enabled).
I'm now trying to create a build. I set things up as I feel they should be and it sits at this screen:
Build
Waiting for an available agent
Console
Waiting for an agent to be requested
The VSTS Agent service is running on the build agent system. so I feel that is OK. I'm somewhat at a loss. Any assistance is appreciated.
Just try the below items to narrow down the issue:
Check if the service "Visual Studio Team Foundation Background Job Agent" is running on the TFS application tier server.
If it's not started, just start the service.
If the status is Running, just try to Restart the service.

We have just spent five days trying to diagnose this issue and believe we have finally nailed the cause (and the solution!).
TL;DR version:
We're using TFS 2017 Update 3, YMMV. We believe the problem is a result of a badly configured old version of an Elastic Search component which is used by the Code Search extension. If you do not use the Code Search feature please disable or uninstall this extension and report back - we have seen huge improvements as a result.
Detailed explanation:
So what we discovered was that MS have repurposed an Elastic Search component to provide the code search facility within TFS - the service is installed when TFS is installed if you choose to include the search feature.
For those unfamiliar with Elastic, one particularly important aspect is that it uses a multi-node architecture, shifting load between nodes and balancing the workload across the cluster and herein lies the MS Code Search problem.
The Elastic Search component installed in TFS is (badly) configured to be single node, with a variety of features intentionally suppressed or disabled. With the high water-mark setting set to 85%, as soon as the search data reaches 85% of the available disk space on the data drive, the node stops creating new indexes and will only accept data to existing indexes.
In a normal Elastic cluster, this would cause another node to create a new index to accept the new data but, since MS have lobotomised the cluster down to one node, the fall-back... is the same node - rinse and repeat.
The behaviour we saw, looking at the communications between the build agent and the build controller, suggests that the Build Controller tries to communicate with Elastic and eventually fails. Over time, Elastic becomes more unresponsive and chokes this communication which manifests as the controller taking longer and longer to respond to build requests.
It is only because we actually use Elastic Search that we were able to interpret the behaviour and logs to come to this conclusion. Without that knowledge it would be almost impossible to determine the actual cause.
How to fix this?
There are a number of ways that you can fix this:
If you don't want to use the Code Search feature, don't install it. The problem will not occur.
If you don't use the Code Search feature, uninstall it. The problem will go away - you can either just disable the extension in all collections or you can use the server installer to fully remove it. I have detailed instructions from MS for anyone who wants to eradicate it completely, just ask.
If you use Elastic properly, rather than stuffing it in a small box on its own, the problem will not occur.
Elastic will continue to function "properly" and should return search results as expected, within the limited parameters.
Hope this helps someone out there avoid the pain we suffered.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With