Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AppFabric Cache seems unstable

We're trying to use AppFabric distributed cache. After a lot of back and forth with non-domain servers we finally put them in a domain and installation/setup was a bit easier. We got it up and running after fighting through a ton of errors, most of which seems trivial to include some test or more descriptive error message for in AppFabric. "Temporary error" does not explain a lot...

But there are still issues.

We set up 3 servers, one of which is "lead". We finally got the cache working and we confirmed this by pointing a Network Load Balancer to one server at a time confirming that we can set cache at one server and retrieve it at another.

Then I restarted the AppFabric Caching service on all servers and suddenly it is not working. Get-CacheHost says they are up, but we get exceptions like:

ErrorCode<ERRCA0018>:SubStatus<ES0001>:The request timed out
ErrorCode<ERRCA0017>:SubStatus<ES0001>:There is a temporary failure. Please retry later.

Why would this error condition occur by simply restarting the services?
Is AppFabric Cache really ready for production use?
What happens if a server goes offline? Long timeouts?
Are we dependent on the "lead" server being up?

I suspect it will be back up after 5-10 minutes of R&R. It seems to come back by itself sometimes.

Update: It did come up after a few minutes. We have now tested by removing one server from the cluster and it resulted in a long timeout and finally an exception.

like image 465
Tedd Hansen Avatar asked Jan 20 '11 11:01

Tedd Hansen


1 Answers

We have been debugging this for some time and I'm sharing what we have found so far.

  • UAC on Windows 2008 actually blocks access to local computer, so commands towards local computer will fail. Start PowerShell as admin or turn off UAC completely to bypass.
  • Simply changing the config file manually will not work. You need to use export and import commands.
  • Firewalls are a major issue as the installer opens the 222* range of ports, but the PowerShell tools use other Windows services. Turning off the firewall on all servers (not recommended) solved the problem.
  • If a server is removed from the cluster there will be an initial timeout before the cluster can operate again.
  • After restart the cluster uses 2-5 minutes to get back up.
  • If restarting and one server is not reachable the startup time is increased.
  • If the server holding the shared fileshare for config is not reachable the services will not start. We tried to solve this by giving each server a private share.
like image 85
Tedd Hansen Avatar answered Oct 20 '22 04:10

Tedd Hansen