Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PhantomJS using too many threads

I wrote a PhantomJS app to crawl over a site I built and check for a JavaScript file to be included. The JavaScript is similar to Google where some inline code loads in another JS file. The app looks for that other JS file which is why I used Phantom.

What's the expected result?

The console output should read through a ton of URLs and then tell if the script is loaded or not.

What's really happening?

The console output will read as expected for about 50 requests and then just start spitting out this error:

2013-02-21T10:01:23 [FATAL] QEventDispatcherUNIXPrivate(): Can not continue without a thread pipe
QEventDispatcherUNIXPrivate(): Unable to create thread pipe: Too many open files

This is the block of code that opens a page and searches for the script include:

page.open(url, function (status) {
    console.log(YELLOW, url, status, CLEAR);
    var found =  page.evaluate(function () {
      if (document.querySelectorAll("script[src='***']").length) {
        return true;
      } else { return false; }
    });

    if (found) {
      console.log(GREEN, 'JavaScript found on', url, CLEAR);
    } else {
      console.log(RED, 'JavaScript not found on', url, CLEAR);
    }
    self.crawledURLs[url] = true;
    self.crawlURLs(self.getAllLinks(page), depth-1);
  });

The crawledURLs object is just an object of urls that I've already crawled. The crawlURLs function just goes through the links from the getAllLinks function and calls the open function on all links that have the base domain of the domain that the crawler started on.

Edit

I modified the last block of the code to be as follows, but still have the same issue. I have added page.close() to the file.

if (!found) {
  console.log(RED, 'JavaScript not found on', url, CLEAR);
}
self.crawledURLs[url] = true;
var links = self.getAllLinks(page);
page.close();
self.crawlURLs(links, depth-1);
like image 284
Dave Long Avatar asked Feb 21 '13 15:02

Dave Long


2 Answers

From the documentation:

Due to some technical limitations, the web page object might not be completely garbage collected. This is often encountered when the same object is used over and over again.

The solution is to explicitly call close() of the web page object (i.e. page in many cases) at the right time.

Some included examples, such as follow.js, demonstrate multiple page objects with explicit close.

like image 182
Ariya Hidayat Avatar answered Oct 04 '22 22:10

Ariya Hidayat


Open Files Limit.

Even with closing files properly, you might still run into this error.

After scouring the internets I discovered that you need to increase your limit of the number of files a single process is allowed to have open. In my case, I was generating PDFs with hundreds to thousands of pages.

There are different ways to adjust this setting based on the system you are running but here is what worked for me on an Ubuntu server:

Add the following to the end of /etc/security/limits.conf:

# Sets the open file maximum here.
# Generating large PDFs hits the default ceiling (1024) quickly. 
*    hard nofile 65535
*    soft nofile 65535
root hard nofile 65535 # Need these two lines because the wildcards (above)
root soft nofile 65535 # are not applied to the root user as well.

A good reference for the ulimit command can be found here.

I hope that puts some people on the right track.

like image 26
Joshua Pinter Avatar answered Oct 04 '22 21:10

Joshua Pinter