I'm crawling a lot of links with the request module in parallel with combination of the async module. I'm noticing alot of <code>ETIMEDOUT</code> and <code>ESOCKETTIMEDOUT</code> errors although the links are reachable and respond fairly quickly using chrome. I've limit the <code>maxSockets</code> to 2 and the <code>timeout</code> to 10000 in the request options. I'm using <code>async.filterLimit()</code> with a limit of 2 to even cut down the parallelism to 2 request each time. So I have 2 sockets, 2 request, and a timeout of 10 seconds to wait for headers response from the server yet I get these errors. Here;s request configuration I use: <pre class="prettyprint"><code>{ ... pool: { maxSockets: 2 }, timeout: 10000 , time: true ... } </code></pre> Here's the snippet of code I use to fecth links: <pre class="prettyprint"><code>var self = this; async.filterLimit(resources, 2, function(resource, callback) { request({ uri: resource.uri }, function (error, response, body) { if (!error && response.statusCode === 200) { ... } else { self.emit('error', resource, error); } callback(...); }) }, function(result) { callback(null, result); }); </code></pre> I listened to the error event and I see whenever the error code is <code>ETIMEDOUT</code> the connect object is either true/false so sometimes it's a connection timeout and sometimes it's not (according to request docs) UPDATE: I decided to boost up the <code>maxSockets</code> to <code>Infinity</code> so no connection will be hangup due to lack of available sockets: <pre class="prettyprint"><code>pool: { maxSockets: Infinity } </code></pre> In-order to control the bandwidth I implemented a <code>requestLoop</code> method that handle the request with a <code>maxAttemps</code> and <code>retryDelay</code> parameters to control the requests: <pre class="prettyprint"><code>async.filterLimit(resources, 10, function(resource, callback) { self.requestLoop({ uri: resource.uri }, 100, 5000, function (error, response, body) { var fetched = false; if (!error) { ... } else { .... } callback(...); }); }, function(result) { callback(null, result); }); </code></pre> Implementation of requestLoop: <pre class="prettyprint"><code>requestLoop = function(options, attemptsLeft, retryDelay, callback, lastError) { var self = this; if (attemptsLeft <= 0) { callback((lastError != null ? lastError : new Error('...'))); } else { request(options, function (error, response, body) { var recoverableErrors = ['ESOCKETTIMEDOUT', 'ETIMEDOUT', 'ECONNRESET', 'ECONNREFUSED']; var e; if ((error && _.contains(recoverableErrors, error.code)) || (response && (500 <= response.statusCode && response.statusCode < 600))) { e = error ? new Error('...'); e.code = error ? error.code : response.statusCode; setTimeout((function () { self.requestLoop(options, --attemptsLeft, retryDelay, callback, e); }), retryDelay); } else if (!error && (200 <= response.statusCode && response.statusCode < 300)) { callback(null, response, body); } else if (error) { e = new Error('...'); e.code = error.code; callback(e); } else { e = new Error('...'); e.code = response.statusCode; callback(e); } }); } }; </code></pre> So this to sum it up: - Boosted <code>maxSockets</code> to <code>Infinity</code> to try overcome timeout error of sockets connection - Implemnted <code>requestLoop</code> method to control failed request and <code>maxAttemps</code> as well as <code>retryDelay</code> of such requests - Also there's maxium number of concurrent request set by the number passed to <code>async.filterLimit</code> I want to note that I've also played with the settings of everything here in-order to get errors free crawling but so far attempts failed as-well. Still looking for help about solving this problem. UPDATE2: I've decided to drop async.filterLimit and make my own limit mechanism. I just have 3 variables to help me achieve this: <code>pendingRequests</code> - a request array which will hold all requests (will explain later) <code>activeRequests</code> - number of active requests <code>maxConcurrentRequests</code> - number of maximum allowed concurrent requests into the pendingRequests array, i push a complex object containing a reference to the requestLoop function as well as arguments array containing the arguments to be passed to the loop function: <pre class="prettyprint"><code>self.pendingRequests.push({ "arguments": [{ uri: resource.uri.toString() }, self.maxAttempts, function (error, response, body) { if (!error) { if (self.policyChecker.isMimeTypeAllowed((response.headers['content-type'] || '').split(';')[0]) && self.policyChecker.isFileSizeAllowed(body)) { self.totalBytesFetched += body.length; resource.content = self.decodeBuffer(body, response.headers["content-type"] || '', resource); callback(null, resource); } else { self.fetchedUris.splice(self.fetchedUris.indexOf(resource.uri.toString()), 1); callback(new Error('Fetch failed because a mime-type is not allowed or file size is bigger than permited')); } } else { self.fetchedUris.splice(self.fetchedUris.indexOf(resource.uri.toString()), 1); callback(error); } self.activeRequests--; self.runRequest(); }], "function": self.requestLoop }); self.runRequest(); </code></pre> You'' notice the call to <code>runRequest()</code> at the end. This function job is to manage the requests and fire requests when it can while keeping the maximum <code>activeRequests</code> under the limit of <code>maxConcurrentRequests</code>: <pre class="prettyprint"><code>var self = this; process.nextTick(function() { var next; if (!self.pendingRequests.length || self.activeRequests >= self.maxConcurrentRequests) { return; } self.activeRequests++; next = self.pendingRequests.shift(); next["function"].apply(self, next["arguments"]); self.runRequest(); }); </code></pre> This should solve any Timeouts errors, through my testings tho, I've still noticed some timeouts in specific websites I've tested this on. I can't be 100% sure about this, but I'm thinking it's due to the nature of the website backing http-server limiting a user requests to a maximum by doing an ip-checking and as a result returning some HTTP 400 messages to prevent a possible 'attack' on the server.

Edit: duplicate of https://stackoverflow.com/a/37946324/744276 By default, Node has 4 workers to resolve DNS queries. If your DNS query takes long-ish time, requests will block on the DNS phase, and the symptom is exactly <code>ESOCKETTIMEDOUT</code> or <code>ETIMEDOUT</code>. Try increasing your uv thread pool size: <pre class="prettyprint"><code>export UV_THREADPOOL_SIZE=128 node ... </code></pre> or in <code>index.js</code> (or wherever your entry point is): <pre class="prettyprint"><code>#!/usr/bin/env node process.env.UV_THREADPOOL_SIZE = 128; function main() { ... } </code></pre> Edit 1: I also wrote a blog post about it. Edit 2: if queries are non-unique, you may want to use a cache, like nscd.

I found if there are too many async requests, then ESOCKETTIMEDOUT exception happens in linux. The workaround I've found is doing this: setting this options to request(): <code> agent: false, pool: {maxSockets: 100} </code> Notice that after that, the timeout can be lying so you might need to increase it.

Node.js request module getting ETIMEDOUT and ESOCKETTIMEDOUT

Tags:

asynchronous

node.js

httprequest

request

sockets

I'm crawling a lot of links with the request module in parallel with combination of the async module.
I'm noticing alot of ETIMEDOUT and ESOCKETTIMEDOUT errors although the links are reachable and respond fairly quickly using chrome.

I've limit the maxSockets to 2 and the timeout to 10000 in the request options. I'm using async.filterLimit() with a limit of 2 to even cut down the parallelism to 2 request each time. So I have 2 sockets, 2 request, and a timeout of 10 seconds to wait for headers response from the server yet I get these errors.

Here;s request configuration I use:

{     ...     pool: {         maxSockets: 2     },     timeout: 10000     ,     time: true     ... }

Here's the snippet of code I use to fecth links:

var self = this; async.filterLimit(resources, 2, function(resource, callback) {     request({         uri: resource.uri     }, function (error, response, body) {         if (!error && response.statusCode === 200) {             ...         } else {             self.emit('error', resource, error);         }         callback(...);     }) }, function(result) {     callback(null, result); });

I listened to the error event and I see whenever the error code is ETIMEDOUT the connect object is either true/false so sometimes it's a connection timeout and sometimes it's not (according to request docs)

UPDATE: I decided to boost up the maxSockets to Infinity so no connection will be hangup due to lack of available sockets:

pool: {     maxSockets: Infinity }

In-order to control the bandwidth I implemented a requestLoop method that handle the request with a maxAttemps and retryDelay parameters to control the requests:

async.filterLimit(resources, 10, function(resource, callback) {     self.requestLoop({         uri: resource.uri     }, 100, 5000, function (error, response, body) {             var fetched = false;             if (!error) {                 ...             } else {                 ....             }             callback(...);         }); }, function(result) {     callback(null, result); });

Implementation of requestLoop:

requestLoop = function(options, attemptsLeft, retryDelay, callback, lastError) {     var self = this;     if (attemptsLeft <= 0) {         callback((lastError != null ? lastError : new Error('...')));     } else {         request(options, function (error, response, body) {             var recoverableErrors = ['ESOCKETTIMEDOUT', 'ETIMEDOUT', 'ECONNRESET', 'ECONNREFUSED'];             var e;             if ((error && _.contains(recoverableErrors, error.code)) || (response && (500 <= response.statusCode && response.statusCode < 600))) {                 e = error ? new Error('...');                 e.code = error ? error.code : response.statusCode;                 setTimeout((function () {                     self.requestLoop(options, --attemptsLeft, retryDelay, callback, e);                 }), retryDelay);             } else if (!error && (200 <= response.statusCode && response.statusCode < 300)) {                 callback(null, response, body);             } else if (error) {                 e = new Error('...');                 e.code = error.code;                 callback(e);             } else {                 e = new Error('...');                 e.code = response.statusCode;                 callback(e);             }         });     } };

So this to sum it up: - Boosted maxSockets to Infinity to try overcome timeout error of sockets connection - Implemnted requestLoop method to control failed request and maxAttemps as well as retryDelay of such requests - Also there's maxium number of concurrent request set by the number passed to async.filterLimit

I want to note that I've also played with the settings of everything here in-order to get errors free crawling but so far attempts failed as-well.

Still looking for help about solving this problem.

UPDATE2: I've decided to drop async.filterLimit and make my own limit mechanism. I just have 3 variables to help me achieve this:
pendingRequests - a request array which will hold all requests (will explain later) activeRequests - number of active requests maxConcurrentRequests - number of maximum allowed concurrent requests

into the pendingRequests array, i push a complex object containing a reference to the requestLoop function as well as arguments array containing the arguments to be passed to the loop function:

self.pendingRequests.push({     "arguments": [{         uri: resource.uri.toString()     }, self.maxAttempts, function (error, response, body) {         if (!error) {             if (self.policyChecker.isMimeTypeAllowed((response.headers['content-type'] || '').split(';')[0]) &&                 self.policyChecker.isFileSizeAllowed(body)) {                 self.totalBytesFetched += body.length;                 resource.content = self.decodeBuffer(body, response.headers["content-type"] || '', resource);                 callback(null, resource);             } else {                 self.fetchedUris.splice(self.fetchedUris.indexOf(resource.uri.toString()), 1);                 callback(new Error('Fetch failed because a mime-type is not allowed or file size is bigger than permited'));             }         } else {             self.fetchedUris.splice(self.fetchedUris.indexOf(resource.uri.toString()), 1);             callback(error);         }         self.activeRequests--;         self.runRequest();     }],     "function": self.requestLoop }); self.runRequest();

You'' notice the call to runRequest() at the end. This function job is to manage the requests and fire requests when it can while keeping the maximum activeRequests under the limit of maxConcurrentRequests:

var self = this; process.nextTick(function() {     var next;     if (!self.pendingRequests.length || self.activeRequests >= self.maxConcurrentRequests) {         return;     }     self.activeRequests++;     next = self.pendingRequests.shift();     next["function"].apply(self, next["arguments"]);     self.runRequest(); });

This should solve any Timeouts errors, through my testings tho, I've still noticed some timeouts in specific websites I've tested this on. I can't be 100% sure about this, but I'm thinking it's due to the nature of the website backing http-server limiting a user requests to a maximum by doing an ip-checking and as a result returning some HTTP 400 messages to prevent a possible 'attack' on the server.

222

asked Feb 14 '16 01:02

Jorayen

2 Answers

Edit: duplicate of https://stackoverflow.com/a/37946324/744276

By default, Node has 4 workers to resolve DNS queries. If your DNS query takes long-ish time, requests will block on the DNS phase, and the symptom is exactly ESOCKETTIMEDOUT or ETIMEDOUT.

Try increasing your uv thread pool size:

export UV_THREADPOOL_SIZE=128 node ...

or in index.js (or wherever your entry point is):

#!/usr/bin/env node process.env.UV_THREADPOOL_SIZE = 128;  function main() {    ... }

Edit 1: I also wrote a blog post about it.

Edit 2: if queries are non-unique, you may want to use a cache, like nscd.

166

answered Sep 29 '22 09:09

Motiejus Jakštys

I found if there are too many async requests, then ESOCKETTIMEDOUT exception happens in linux. The workaround I've found is doing this:

setting this options to request(): agent: false, pool: {maxSockets: 100} Notice that after that, the timeout can be lying so you might need to increase it.

answered Sep 29 '22 08:09

cancerbero

Related questions
                            
                                PhoneGap/Cordova Android Development
                            
                                npm WARN deprecated [email protected]: graceful-fs version 3
                            
                                ESlint - Error: Must use import to load ES Module
                            
                                Nodejs convert string into UTF-8
                            
                                Cannot find module for a node js app running in a docker compose environment
                            
                                Async / await vs then which is the best for performance?
                            
                                passport's req.isAuthenticated always returning false, even when I hardcode done(null, true)
                            
                                FireStore create a document if not exist
                            
                                Stream files in node/express to client
                            
                                Chrome DevTools error: "Failed to save to temp variable."
                            
                                Creating a lambda function in AWS from zip file
                            
                                How to resolve a Socket.io 404 (Not Found) error?
                            
                                How can I execute a node.js module as a child process of a node.js program?
                            
                                Concat and minify JS files in Node [closed]
                            
                                Selenium WebDriver wait till element is displayed
                            
                                Forever + Nodemon running together
                            
                                What does « dehydrate » and « rehydrate » stand for in Fluxible?
                            
                                sequelize "findbyid" is not a function but apparently "findAll" is
                            
                                Node Express 4 middleware after routes
                            
                                What TypeScript configuration produces output closest to Node.js 14 capabilities?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With