Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node.js GET Request ETIMEDOUT & ESOCKETTIMEDOUT

Tags:

I'm using Node.js - async & request module to crawl 100+ millions of websites and I keep bumping into errors ESOCKETTIMEDOUT & ETIMEDOUT after few minutes.

It works again after I restart the script. It doesn't seem to be connection limit issue because I can still do resolve4, resolveNs, resolveMx and also curl without delay.

Do you see any issue with the code? or any advice? I'd like to push up the async.queue() concurrency to at least a 1000. Thank you.

var request = require('request'),     async = require('async'),     mysql = require('mysql'),     dns = require('dns'),     url = require('url'),     cheerio = require('cheerio'),     iconv = require('iconv-lite'),     charset = require('charset'),     config = require('./spy.config'),     pool = mysql.createPool(config.db);  iconv.skipDecodeWarning = true;  var queue = async.queue(function (task, cb) {     dns.resolve4('www.' + task.domain, function (err, addresses) {         if (err) {             //             // Do something             //             setImmediate(function () {                 cb()             });         } else {             request({                 url: 'http://www.' + task.domain,                 method: 'GET',                 encoding:       'binary',                 followRedirect: true,                 pool:           false,                 pool:           { maxSockets: 1000 },                 timeout:        15000 // 15 sec             }, function (error, response, body) {                  //console.info(task);                  if (!error) {                   // If ok, do something                  } else {                     // If not ok, do these                      console.log(error);                      // It keeps erroring here after few minutes, resolve4, resolveNs, resolveMx still work here.                      // { [Error: ETIMEDOUT] code: 'ETIMEDOUT' }                     // { [Error: ESOCKETTIMEDOUT] code: 'ESOCKETTIMEDOUT' }                      var ns = [],                         ip = [],                         mx = [];                     async.parallel([                         function (callback) {                             // Resolves the domain's name server records                             dns.resolveNs(task.domain, function (err, addresses) {                                 if (!err) {                                     ns = addresses;                                 }                                 callback();                             });                         }, function (callback) {                             // Resolves the domain's IPV4 addresses                             dns.resolve4(task.domain, function (err, addresses) {                                 if (!err) {                                     ip = addresses;                                 }                                 callback();                             });                         }, function (callback) {                             // Resolves the domain's MX records                             dns.resolveMx(task.domain, function (err, addresses) {                                 if (!err) {                                     addresses.forEach(function (a) {                                         mx.push(a.exchange);                                     });                                 }                                 callback();                             });                         }                     ], function (err) {                         if (err) return next(err);                          // do something                     });                  }                 setImmediate(function () {                     cb()                 });             });         }     }); }, 200);  // When the queue is emptied we want to check if we're done queue.drain = function () {     setImmediate(function () {         checkDone()     }); }; function consoleLog(msg) {     //console.info(msg); } function checkDone() {     if (queue.length() == 0) {         setImmediate(function () {             crawlQueue()         });     } else {         console.log("checkDone() not zero");     } }  function query(sql) {     pool.getConnection(function (err, connection) {         if (!err) {             //console.log(sql);             connection.query(sql, function (err, results) {                 connection.release();             });         }     }); }  function crawlQueue() {     pool.getConnection(function (err, connection) {         if (!err) {             var sql = "SELECT * FROM domain last_update < (UNIX_TIMESTAMP() - 2592000) LIMIT 500";             connection.query(sql, function (err, results) {                 if (!err) {                     if (results.length) {                         for (var i = 0, len = results.length; i < len; ++i) {                             queue.push({"id": results[i]['id'], "domain": results[i]['domain'] });                         }                     } else {                         process.exit();                     }                     connection.release();                 } else {                     connection.release();                     setImmediate(function () {                         crawlQueue()                     });                 }             });         } else {             setImmediate(function () {                 crawlQueue()             });         }     }); } setImmediate(function () {     crawlQueue() }); 

And the system limits are pretty high.

    Limit                     Soft Limit           Hard Limit           Units     Max cpu time              unlimited            unlimited            seconds     Max file size             unlimited            unlimited            bytes     Max data size             unlimited            unlimited            bytes     Max stack size            8388608              unlimited            bytes     Max core file size        0                    unlimited            bytes     Max resident set          unlimited            unlimited            bytes     Max processes             257645               257645               processes     Max open files            500000               500000               files     Max locked memory         65536                65536                bytes     Max address space         unlimited            unlimited            bytes     Max file locks            unlimited            unlimited            locks     Max pending signals       257645               257645               signals     Max msgqueue size         819200               819200               bytes     Max nice priority         0                    0     Max realtime priority     0                    0     Max realtime timeout      unlimited            unlimited            us 

sysctl

net.ipv4.ip_local_port_range = 10000    61000 
like image 339
Tan Hong Tat Avatar asked Jun 20 '14 05:06

Tan Hong Tat


2 Answers

By default, Node has 4 workers to resolve DNS queries. If your DNS query takes long-ish time, requests will block on the DNS phase, and the symptom is exactly ESOCKETTIMEDOUT or ETIMEDOUT.

Try increasing your uv thread pool size:

export UV_THREADPOOL_SIZE=128 node ... 

or in index.js (or wherever your entry point is):

#!/usr/bin/env node process.env.UV_THREADPOOL_SIZE = 128;  function main() {    ... } 

Edit: I also wrote blog post about it.

like image 80
Motiejus Jakštys Avatar answered Sep 22 '22 10:09

Motiejus Jakštys


10/31/2017 The final solution we found is to use the keepAlive option in an agent. For example:

var pool = new https.Agent({ keepAlive: true });  function getJsonOptions(_url) {     return {         url: _url,         method: 'GET',         agent: pool,         json: true     }; } 

Node's default pool seems to default to keepAlive=false which causes a new connection being created on each request. When too many connections are created in a short period of time, the above error would surface. My guess is that one or more routers along the path to the service blocks connection request, probably in suspicion of Deny Of Service attack. In any case, the code sample above completely solved our problem.

7/16/2021 There is an easier solution to this problem:

    var http = require('http');     http.globalAgent.keepAlive = true;      var https = require('https');     https.globalAgent.keepAlive = true; 

I verified in code that global agents' keepAlive were set to false despite documentation saying that the default should be true.

like image 32
Alex W Avatar answered Sep 24 '22 10:09

Alex W