Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node.js scraping with chrome-remote-interface

I have been trying to scrape a website protected by Distil Networks, in which using selenium (with Python) would just always fail.

I did a few searches, and my conclusion is that the site can detect you are using Selenium by using some sort of javascript. I then took a loot at chrome-remote-interface, like it is the thing that I want, but then I got stuck.

What I would like to do is to automate following steps:

  1. Open a Chrome instance
  2. Navigate to a page
  3. Run some javascript
  4. Collect data and save to file
  5. Repeat steps 2 - 4

I know that I can open a instance of Chrome for debugging by:

google-chrome --remote-debugging-port=9222

And I can open a console on node by:

chrome-remote-interface -t 127.0.0.1 -p 9222 inspect -r

I can also run simple scripts like

Page.navigate({url:"https://google.com"})
Runtime.evaluate({expression:"1+1"})

But like I can't get the DOMs directly on Node.js as what I could do on the Chrome Developer Tools console. Basically what I want is run scripts on Node like what I could do on the Chrome Developer Tools console.

Also , there are not enough documentation on chrome-remote-interface for scraping. Is there any good links for that?

like image 817
Gabriel Koo Avatar asked May 04 '17 16:05

Gabriel Koo


2 Answers

I know it's has been asked two years ago, but let me write it here for documentation purposes.

-- Tools of the trade --
I tried the same technique as you did (used the remote debugger for scraping) but instead of using Python i used Node.js because of it's asynchronous nature, thus making easier to work with websockets that the remote debugger relies on.

-- Runtime.evaluate --
One thing i noted is that Runtime.evaluate isn't a valid option for recovering any data if your expression involves asynchronous calls because it returns the result of the calling function and not of the callback function. You have to stick with synchronous expressions.
Example:

Array.from(document.getElementByTagName('tr'))
    .map((e)=>e.children[2].innerHTML)
    .filter((e)=>e.length>0)

Other thing is that when your expression returns an array Runtime.evaluate just mention that the expression returned an array but not the array itself! (infuriating i know) I got around it by simply enconding the arrays as JSON strings in the page context then decoding it back to object when it arrives at the Node.js. For example the above expression would need to be:

JSON.stringify(
    Array.from(document.getElementByTagName('tr'))
        .map((e)=>e.children[2].innerHTML)
        .filter((e)=>e.length>0)
)

-- Navigation --
When you trigger a page load by using "Page.navigate", ".click()", ".submit()", "window.location.href=..." or any other way it's important to know when the next page was completely loaded before sending more instructions with Runtime.evaluate. I did the trick asking the debugger to send me the page loading events(look for the Page.enable method in the documentation) then waiting for the "Page.loadEventFired" event before sending more expressions.

like image 187
Silas M Avatar answered Oct 13 '22 03:10

Silas M


JavaScript expressions evaluated by Runtime.evaluate are executed within the page context just like what happens in the DevTools console.

You can interact with the DOM using the DOM domain, e.g., DOM.getDocument, DOM.querySelector, etc.

Also remember that chrome-remote-interface is mainly a library meaning that it allows you to write your own Node.js applications, the chrome-remote-interface inspect is just an utility.

There are several places where you can get help:

  • open an issue to chrome-remote-interface;
  • the chrome-remote-interface wiki;
  • the Chrome DevTools Protocol Viewer;
  • the Chrome Debugging Protocol Google Group.

If you ask something more specific I'd be happy to try to help you with that.

Finally you may want to take a look at automated-chrome-profiling, which I think is structurally similar to what you're trying to achieve.

like image 22
cYrus Avatar answered Oct 13 '22 01:10

cYrus