Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Random and occasional network error (NSURLErrorDomain Code=-1001 and NSURLErrorDomain Code=-1005)

The last couple of days I've tried to debug a network error from d00m. I'm starting to run out of ideas/leads and my hope is that other SO users have valuable experience that might be useful. I hope to be able to provide all relevant information, but I'm not personally in control of the server environments.

The whole thing started by users noticing a couple of "network errors" in our app. The error seemed to occur randomly, without any noticeable pattern related to internet connectivity, iOS version or backend updates. The two errors that occurs behind the scenes are:

Error Domain=NSURLErrorDomain Code=-1001 "The request timed out."

and more frequently:

Error Domain=kCFErrorDomainCFNetwork Code=-1005 "The network connection was lost.

After debugging it for a couple of days, I've managed to reproduce these errors (occurring at random) by firing approx. 10 random (GET and POST) requests towards our backend with a random sleep timer between each request (set at 1-20 seconds). However, it only occurs in periods. What I've experienced the last couple of days is that when a "period of error" starts, I get one of the two errors every once or twice I run the code (meaning an error rate of 1/10 or 1/20 requests). This error rate continues for a couple of hours and then the error disappears for a couple of hours and then it starts all over.

Some quick facts about the setup:

  • Happens on device and simulator
  • Happens on iOS 8.4 and iOS 7.1 - although v. 8.4 is the main one I use for testing.
  • We use NSURLSession for our network requests. We also have AFNetworking included (updated to latest version), but we only use the Security part for SSL Pinning. Even with SSL pinning totally turned off, the error still occurs.

Some findings I've written down during the last couple of days:

  • It seems to only happen on our production environments which has some different configuration as our staging environments. This lead me to think that it might be related to the keep-alive bug as discussed here and here. However, our ops department have set up a new staging environment sending the same keep-alive header as the production environments, but this did not make the error occur on the staging environment.
  • Our Android version of the app were unable to reproduce the error using the same setup of requests. Further, we've not received any customer issues on "network errors" in the Android app.

My gut feeling says that it's related to the server environment and the HTTP implementation in iOS. I'm however unable to track down a convincing pattern that proves anything. I've made the same setup using a simple Rails script, and when the next "error period" occur, I'll be ready to try and reproduce it outside of iOS land. I'll update the question when this happens.

I'm not looking for solutions involving resetting wifi settings, shutting down the simulator or similar as I do not see this as feasible solutions in a production environment. I've also considered making the retry-loop-fix as mentioned in the GitHub issue, but I see this as a last resort.

Please let me know if you need any more information.

like image 514
Steffen D. Sommer Avatar asked Sep 10 '15 12:09

Steffen D. Sommer


1 Answers

In my experience, those sorts of problems usually point to massive packet loss, particularly over a cell network, where minor variations in multipath interference and other issues can make the difference between reliably passing traffic and not.

Another possibility that comes to mind is poor-quality NAT implementations, in the unlikely event that your server's timeout interval is long enough to cause the NAT to give up on the TCP connection.

Either way, the only way to know for sure what's happening is to take a packet trace. To do that, connect a Mac to the Internet via a wired connection, enable network sharing over Wi-Fi, and connect the iOS device to that Wi-Fi network. Then run Wireshark and tell it to monitor the bridge interface. Instructions here:

http://www.howtogeek.com/104278/how-to-use-wireshark-to-capture-filter-and-inspect-packets/

From there, you should be able to see exactly what is being sent and when. That will probably go a long way towards understanding why it is failing.

like image 57
dgatwood Avatar answered Nov 10 '22 11:11

dgatwood