Here's some background info on the system I'm running:
Ubuntu v 14.04
Node v4.4.0
Node request
module v2.69.0
All of this on a DigitalOcean droplet/server on a New York-based center.
So I run the following js file:
var request = require('request');
var url = 'http://www.supremenewyork.com/';
request(url, function(err, res, body) {
if (err) {
console.log(err);
return;
}
console.log('body:', body);
});
On my droplet. Roughly 70-80% of the time I try this, Now every single time I try this, I'll get an ETIMEDOUT
error like so:
{ [Error: connect ETIMEDOUT 52.6.25.180:80]
code: 'ETIMEDOUT',
errno: 'ETIMEDOUT',
syscall: 'connect',
address: '52.6.25.180',
port: 80 }
Of note, the errors seem to come in 'waves'. That is, I'll manage to get a handful of requests through for a certain period of time, followed by a string of ETIMEDOUT
errors. Errors happen more often than I am able to get my requests through by a ratio of approximately 3:1 errors to successes.
On my own computer (Mac running OS X El Capitan), running the js file for the given site works with 100% success (i.e. I've never run into this problem before)... so I'm not sure why the problem is contained to my droplet.
Any pointers would be appreciated.
Node.js 0.4.10. http get( ) request " ETIMEDOUT Connection timed out " frequently
Why can't I ping herokuapp <-- starting to get a better picture of what's going on here...
Problem with http GET request on node js <-- seemed helpful at first (later realized setting User-Agent probably does nothing significant)
I also feel that it's worth mentioning the site I'm trying to make requests at actively has a problem with scripts and web scrapers, so I wouldn't be surprised if they tried everything in the book to prevent this from taking place.
IP address blocking --> not the case (yet) as I will still occasionally get responses from the server I am no longer able to get any sort of response from the server. This might be the cause, but I am really confused at how they might be doing this. No issues on my local machine, no issues requesting their page from a browser on my droplet, but then this.
'Rate-limiting' of my requests --> if this is somehow the case, I would like to know why this is happening specifically on my server and not, say, on my local machine
The manner in which I'm making my requests (i.e. not through a browser). --> I don't think this is the case because I can run the first script with a 100% response rate on my local computer (unless there is something my local computer does before sending my request to their server).
The system itself. I've only tested the first script on my Mac. Perhaps the code runs differently on different OS's/systems..?
So as per @ RabeeAbdelWahab's suggestion, I attempted to diagnose the problem with traceroute. However, I have practically no knowledge of networks so I'm not sure how to proceed. Here's an example output:
traceroute to <> (XXX.XXX.XXX.XXX), 30 hops max, 60 byte packets
1 45.55.192.254 (45.55.192.254) 8.903 ms 8.879 ms 8.865 ms
2 162.243.188.229 (162.243.188.229) 1.028 ms 162.243.188.233 (162.243.188.233) 0.986 ms 1.004 ms
3 xe-0-9-0-17.r08.nycmny01.us.bb.gin.ntt.net (129.250.204.113) 1.923 ms 1.918 ms nyk-b3-link.telia.net (62.115.45.5) 1.587 ms
4 ae-11.amazon.nycmny01.us.bb.gin.ntt.net (129.250.201.138) 1.935 ms ae-10.amazon.nycmny01.us.bb.gin.ntt.net (129.250.201.134) 1.586 ms *
5 nyk-b5-link.telia.net (213.155.131.137) 1.822 ms * *
6 * * 62.115.32.130 (62.115.32.130) 1.361 ms
7 * * *
8 * * *
9 * * *
10 54.239.110.157 (54.239.110.157) 33.817 ms * 54.239.110.133 (54.239.110.133) 27.683 ms
11 54.239.111.17 (54.239.111.17) 8.193 ms 205.251.244.128 (205.251.244.128) 7.883 ms 54.239.111.23 (54.239.111.23) 9.319 ms
12 205.251.245.55 (205.251.245.55) 8.253 ms 54.239.110.175 (54.239.110.175) 24.601 ms 205.251.244.195 (205.251.244.195) 8.250 ms
13 * 54.239.111.27 (54.239.111.27) 9.319 ms 54.239.111.29 (54.239.111.29) 9.290 ms
14 * * *
15 54.239.111.23 (54.239.111.23) 9.136 ms * *
16 * * *
17 * * *
18 * * *
19 * * *
20 * * *
21 * * *
22 * * *
23 * * *
24 * * *
25 * * *
26 * * *
27 * * *
28 * * *
29 * * *
30 * * *
So after running traceroute
several more times, I notice the following patterns:
The "***" outputs begin at some point on or slightly after the 15th hop.
The last IP Address before the "* * *" hops mostly seems to alternate between the same to addresses: 205.251.XXX.XXX
(slightly more often the case) or 54.239.XXX.XXX
. In a few select instances I'll get an address like 72.21.222.155
.
In addition, I have seen no differences when:
Running traceroute
with the -m 255
option (i.e. max number of hops).
Running traceroute
with the -I
option.
Running traceroute
with the -e
option.
Running traceroute
with the -p 80
or -p 25
options.
Running traceroute
on a different droplet located in the same data center as the droplet in question.
Using ping
, here's a running list of sites I can and cannot connect to:
Can connect
google.com
facebook.com
reddit.com
github.com
stackoverflow.com
youtube.com
twitter.com
Can't connect:
amazon.com
microsoft.com
apple.com
walmart.com
paypal.com
cnn.com
nyt.org
wolframalpha.com
Observations: Is there a reason why I seem to be able to connect to sites that have 'social' features (and otherwise not)?
Apparently, it's common for sites not to return replies by ICMP (which is what
ping
,traceroute
uses). Please disregard the above...
So I've noticed that if I modify my request to take an additional 'User-Agent' header (code example provided below), I'm able to initially get back the html response.
var request = require('request');
var requestOptions =
{
url: 'http://www.supremenewyork.com/some/route',
headers: {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}
};
request(requestOptions, function(err, res, body) {
if (err) {
console.log(err);
return;
}
console.log('body:', body);
});
I'm actually able to get back a response using the above method a few times. Afterwards, it seems all my connections lead to the aforementioned ETIMEDOUT error. Then I'll have to wait some lengthy period of time and it's rinse, wash, and repeat.
I actually performed a simple two-tailed proportional test for the above (i.e. receiving a response with and without a 'User-Agent' header) and got a p-value of 0.8493... so no statistical significance between the two. Again, please disregard the aforementioned...
Since you said they had issues and are trying to prevent scraping or something, you may be subject to those efforts. Why would you need to keep hitting their page so often?
I think if you really want it to work you are going to need to fool their anti-scraping systems (firewall or whatever). So you can try using a droplet in a different data center/city and also try adding headers to imitate a web browser. User-Agent would be the first I would try.
var options = { headers: { "user-agent":
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"}, url: "www.supremenewyork.com"}
Also make sure you don't hit their site too often and get rate limited.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With