Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

WCF timeouts are a nightmare

Tags:

We have a bunch of WCF services that work almost all of the time, using various bindings, ports, max sizes, etc. The super-frustrating thing about WCF is that when it (rarely) fails, we are powerless to find out why it failed. Sometimes you will get a message that looks like this:

System.ServiceModel.CommunicationException: The socket connection was aborted. This could be caused by an error processing your message or a receive timeout being exceeded by the remote host, or an underlying network resource issue. Local socket timeout was '01:00:00'. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.

The problem is that the local socket timeout it's giving you is merely an attempt to be convenient. It may or may not be the cause of the problem. But OK, sometimes networks have issues. No big deal. We can retry or something. But here's the huge problem. On top of failing to tell you which precisely which timeout (if any) resulted in the failure ("your server-side receive timeout was exceeded," or something, would be helpful), WCF seems to have two types of timeouts.

Timeout Type #1) A timeout, that, if increased, would increase the chance of your operation's success. So, the pertinent timeout is an hour, you are uploading a huge file that will take an hour and twenty minutes. It fails. You increase the timeout, it succeeds. I have no no problem with this type of timeout.

Timeout Type #2) A timeout which merely defines how long you have to wait for the service to actually fail and give you an error, but modifying the value of this timeout has no impact on the chance of success. Basically, something happens during the first second of the service request which mucks things up. It will never recover. WCF doesn't magically retry the network connection for you. Fine, sometimes establishing a network connection doesn't go well. But, if your timeout is 2 hours, you have to wait 2 whole hours with no chance of it ever working before it finally acknowledges that it didn't work and gives you the error.

But the error you see in both cases looks the same. With timeout Type #2, it still looks like you are running into a timeout. But, you could increase all of your timeouts to 4 years, and all it would do is make it take 4 years to get an error message. I know that Type #2 exists because I can do an operation that is known to complete in less than a minute when successful, and have it take 2 hours to fail. But, if I kill it and retry, it succeeds quickly. (If you are wondering why there might be a 2 hour timeout on an operation that takes less than a minute, there are times I run the operation with a much larger file and it could take over an hour.)

So, to combat the problem with Type #2, you'd want your timeout to be really quick so you immediately know if there is a problem. Then you can retry. But the insurmountable problem is that because I don't know which timeouts are the cause of failure, I don't know what timeouts are Type #1 and which ones are Type #2. There may be one timeout (let's say the client-side send timeout) that acts like Type #1 in some cases and Type #2 in others. I have no idea, and I have no way of finding out.

Does anyone know how to track down Type #2 timeouts so I can set them to low values without having to shorten actual (read: Type #1) timeouts and lower the chance of success?

Thank you.

Clarification of Type #2 timeouts in response to Andrew Anderson's comment:

My belief is that something goes wrong between the client request and the code starting to execute on the server. In all cases where we have the server code indicate partial progress, it's never finished some of the operation without finishing the whole thing. So, the server code never gets to execute, and how long it would take to execute is irrelevant (other than that it affects what we set our timeout values to in the first place in order to accommodate it).

like image 551
Greg Smalter Avatar asked Jun 09 '10 15:06

Greg Smalter


1 Answers

I always put a "heartbeat" message in my long-running WCF services. Then you can set Type #1 timeouts to a low value (2-3 times the heartbeat call frequency), and Type #2 timeouts become obvious.

like image 145
Stephen Cleary Avatar answered Oct 21 '22 11:10

Stephen Cleary