Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node request for certain site results in ETIMEDOUT error most of the time

Specs

Here's some background info on the system I'm running:

  • Ubuntu v 14.04

  • Node v4.4.0

  • Node request module v2.69.0

All of this on a DigitalOcean droplet/server on a New York-based center.

 

Problem Description

So I run the following js file:

var request = require('request');

var url = 'http://www.supremenewyork.com/';

request(url, function(err, res, body) { 
  if (err) {
    console.log(err);
    return;
  }

  console.log('body:', body);
});

On my droplet. Roughly 70-80% of the time I try this, Now every single time I try this, I'll get an ETIMEDOUT error like so:

{ [Error: connect ETIMEDOUT 52.6.25.180:80]
  code: 'ETIMEDOUT',
  errno: 'ETIMEDOUT',
  syscall: 'connect',
  address: '52.6.25.180',
  port: 80 }

Of note, the errors seem to come in 'waves'. That is, I'll manage to get a handful of requests through for a certain period of time, followed by a string of ETIMEDOUT errors. Errors happen more often than I am able to get my requests through by a ratio of approximately 3:1 errors to successes.

On my own computer (Mac running OS X El Capitan), running the js file for the given site works with 100% success (i.e. I've never run into this problem before)... so I'm not sure why the problem is contained to my droplet.

Any pointers would be appreciated.

 

Research/Similar Posts:

  • Node.js 0.4.10. http get( ) request " ETIMEDOUT Connection timed out " frequently

  • Why can't I ping herokuapp <-- starting to get a better picture of what's going on here...

  • Problem with http GET request on node js <-- seemed helpful at first (later realized setting User-Agent probably does nothing significant)

 

Additional Info

I also feel that it's worth mentioning the site I'm trying to make requests at actively has a problem with scripts and web scrapers, so I wouldn't be surprised if they tried everything in the book to prevent this from taking place.

 

Possible Causes

  • IP address blocking --> not the case (yet) as I will still occasionally get responses from the server I am no longer able to get any sort of response from the server. This might be the cause, but I am really confused at how they might be doing this. No issues on my local machine, no issues requesting their page from a browser on my droplet, but then this.

  • 'Rate-limiting' of my requests --> if this is somehow the case, I would like to know why this is happening specifically on my server and not, say, on my local machine

  • The manner in which I'm making my requests (i.e. not through a browser). --> I don't think this is the case because I can run the first script with a 100% response rate on my local computer (unless there is something my local computer does before sending my request to their server).

  • The system itself. I've only tested the first script on my Mac. Perhaps the code runs differently on different OS's/systems..?

 

Diagnosing with traceroute

So as per @ RabeeAbdelWahab's suggestion, I attempted to diagnose the problem with traceroute. However, I have practically no knowledge of networks so I'm not sure how to proceed. Here's an example output:

traceroute to <> (XXX.XXX.XXX.XXX), 30 hops max, 60 byte packets
 1  45.55.192.254 (45.55.192.254)  8.903 ms  8.879 ms  8.865 ms
 2  162.243.188.229 (162.243.188.229)  1.028 ms 162.243.188.233 (162.243.188.233)  0.986 ms  1.004 ms
 3  xe-0-9-0-17.r08.nycmny01.us.bb.gin.ntt.net (129.250.204.113)  1.923 ms  1.918 ms nyk-b3-link.telia.net (62.115.45.5)  1.587 ms
 4  ae-11.amazon.nycmny01.us.bb.gin.ntt.net (129.250.201.138)  1.935 ms ae-10.amazon.nycmny01.us.bb.gin.ntt.net (129.250.201.134)  1.586 ms *
 5  nyk-b5-link.telia.net (213.155.131.137)  1.822 ms * *
 6  * * 62.115.32.130 (62.115.32.130)  1.361 ms
 7  * * *
 8  * * *
 9  * * *
10  54.239.110.157 (54.239.110.157)  33.817 ms * 54.239.110.133 (54.239.110.133)  27.683 ms
11  54.239.111.17 (54.239.111.17)  8.193 ms 205.251.244.128 (205.251.244.128)  7.883 ms 54.239.111.23 (54.239.111.23)  9.319 ms
12  205.251.245.55 (205.251.245.55)  8.253 ms 54.239.110.175 (54.239.110.175)  24.601 ms 205.251.244.195 (205.251.244.195)  8.250 ms
13  * 54.239.111.27 (54.239.111.27)  9.319 ms 54.239.111.29 (54.239.111.29)  9.290 ms
14  * * *
15  54.239.111.23 (54.239.111.23)  9.136 ms * *
16  * * *
17  * * *
18  * * *
19  * * *
20  * * *
21  * * *
22  * * *
23  * * *
24  * * *
25  * * *
26  * * *
27  * * *
28  * * *
29  * * *
30  * * *

 

So after running traceroute several more times, I notice the following patterns:

  • The "***" outputs begin at some point on or slightly after the 15th hop.

  • The last IP Address before the "* * *" hops mostly seems to alternate between the same to addresses: 205.251.XXX.XXX (slightly more often the case) or 54.239.XXX.XXX. In a few select instances I'll get an address like 72.21.222.155.

In addition, I have seen no differences when:

  • Running traceroute with the -m 255 option (i.e. max number of hops).

  • Running traceroute with the -I option.

  • Running traceroute with the -e option.

  • Running traceroute with the -p 80 or -p 25 options.

  • Running traceroute on a different droplet located in the same data center as the droplet in question.

 

Diagnosing with ping

Using ping, here's a running list of sites I can and cannot connect to:

Can connect

  • google.com

  • facebook.com

  • reddit.com

  • github.com

  • stackoverflow.com

  • youtube.com

  • twitter.com

Can't connect:

  • amazon.com

  • microsoft.com

  • apple.com

  • walmart.com

  • paypal.com

  • cnn.com

  • nyt.org

  • wolframalpha.com

Observations: Is there a reason why I seem to be able to connect to sites that have 'social' features (and otherwise not)?

 

Apparently, it's common for sites not to return replies by ICMP (which is what ping, traceroute uses). Please disregard the above...

 

Additional findings

So I've noticed that if I modify my request to take an additional 'User-Agent' header (code example provided below), I'm able to initially get back the html response.

var request = require('request');

var requestOptions = 
{
    url: 'http://www.supremenewyork.com/some/route',
    headers: {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
};

request(requestOptions, function(err, res, body) { 
  if (err) {
    console.log(err);
    return;
  }

  console.log('body:', body);
});

I'm actually able to get back a response using the above method a few times. Afterwards, it seems all my connections lead to the aforementioned ETIMEDOUT error. Then I'll have to wait some lengthy period of time and it's rinse, wash, and repeat.

I actually performed a simple two-tailed proportional test for the above (i.e. receiving a response with and without a 'User-Agent' header) and got a p-value of 0.8493... so no statistical significance between the two. Again, please disregard the aforementioned...

like image 231
youngrrrr Avatar asked Apr 01 '16 13:04

youngrrrr


1 Answers

Since you said they had issues and are trying to prevent scraping or something, you may be subject to those efforts. Why would you need to keep hitting their page so often?

I think if you really want it to work you are going to need to fool their anti-scraping systems (firewall or whatever). So you can try using a droplet in a different data center/city and also try adding headers to imitate a web browser. User-Agent would be the first I would try.

var options = { headers: { "user-agent":
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko)  Chrome/41.0.2228.0 Safari/537.36"}, url: "www.supremenewyork.com"}

Also make sure you don't hit their site too often and get rate limited.

like image 117
Jason Livesay Avatar answered Oct 15 '22 19:10

Jason Livesay