How to get JavaScript object in JavaScript code?

Question

TL;DR

I want parseParameter that parse JSON like the following code. someCrawledJSCode is crawled JavaScript code.

const data = parseParameter(someCrawledJSCode);
console.log(data);  // data1: {...}

Problem

I'm crawling some JavaScript code with puppeteer and I want to extract a JSON object from it, but I don't know how to parse the given JavaScript code.

Crawled JavaScript Code Example:

const somecode = 'somevalue';
arr.push({
  data1: {
    prices: [{
      prop1: 'hi',
      prop2: 'hello',
    },
    {
      prop1: 'foo',
      prop2: 'bar',
    }]
  }
});

In this code, I want to get prices array (or data1).

What I did

I tried parsing code into JSON, but it's not working. So I searched parsing tools and got Esprima. But I think it's not helpful for solving this problem.

Thomas Dondorf · Accepted Answer

Short answer: Don't (re)build a parser in Node.js, use the browser instead

I strongly advise against evaluating or parsing crawled data in Node.js if you are anyway using puppeteer for crawling. When you are using puppeteer you already have a browser with a great sandbox for JavaScript code running in another process. Why risk that kind of isolation and "rebuild" a parser in your Node.js script? If your Node.js script breaks, your whole script will fail. In the worst case, you might even expose your machine to serious risks when you try to run untrusted code inside your main thread.

Instead, try to do as much parsing as possible inside the context of the page. You can even do an evil eval call there. There worst that could happen? Your browser hangs or crashes.

Example

Imagine the following HTML page (very much simplified). You are trying to read the text which is pushed into an array. The only information you have is that there is an additional attribute id which is set to target-data.

<html>
<body>
  <!--- ... -->
  <script>
    var arr = [];
    // some complex code...
    arr.push({
      id: 'not-interesting-data',
      data: 'some data you do not want to crawl',
    });
    // more complex code here...
    arr.push({
      id: 'target-data',
      data: 'THIS IS THE DATA YOU WANT TO CRAWL', // <---- You want to get this text
    });
    // more code...
    arr.push({
      id: 'some-irrelevant-data',
      data: 'again, you do not want to crawl this',
    });
  </script>
  <!--- ... -->
</body>
</html>

Bad code

Here is a simple example what your code might look like right now:

await page.goto('http://...');
const crawledJsCode = await page.evaluate(() => document.querySelector('script').innerHTML);

In this example, the script extracts the JavaScript code from the page. Now we have the JavaScript code from the page and we "only" need to parse it, right? Well, this is the wrong approach. Don't try to rebuild a parser inside Node.js. Just use the browser. There are basically two approaches you can take to do that in your case.

Inject proxy functions into the page and fake some built-in functions (recommended)
Parse the data on the client-side (!) by using JSON.parse, a regex or eval (eval only if really necessary)

Option 1: Inject proxy functions into the page

In this approach you are replacing native browser functions with your own "fake functions". Example:

const originalPush = Array.prototype.push;
Array.prototype.push = function (item) {
    if (item && item.id === 'target-data') {
        const data = item.data; // This is the data we are trying to crawl
        window.exposedDataFoundFunction(data); // send this data back to Node.js
    }
    originalPush.apply(this, arguments);
}

This code replaces the original Array.prototype.push function with our own function. Everything works as normal, but when an item with our target id is pushed into an array, a special condition is triggered. To inject this function into the page, you could use page.evaluateOnNewDocument. To receive the data from Node.js you would have to expose a function to the browser via page.exposeFunction:

// called via window.dataFound from within the fake Array.prototype.push function
await page.exposeFunction('exposedDataFoundFunction', data => {
    // handle the data in Node.js
});

Now it doesn't really matter how complex the code of the page is, whether it happens inside some asynchronous handler or whether the page changes the surrounding code. As long as the target data is pushing the data into an array, we will get it.

You can use this approach for a lot of crawling. Check how the data is processed and replace the low level functions processing the data with your own proxy version of it.

Option 2: Parse the data

Let's assume the first approach does not work for some reason. The data is in some script tag, but you are not able to get it by using fake functions.

Then you should parse the data, but not inside your Node.js environment. Do it inside the page context. You could run a regular expression or use JSON.parse. But do it before returning the data back to Node.js. This approach has the benefit that if your code will crash your environment for some reason, it will not be your main script, but just your browser that crashes.

To give some example code. Instead of running the code from the original "bad code" sample, we change it to this:

const crawledJsCode = await page.evaluate(() => {
    const code = document.querySelector('script').innerHTML; // instead of returning this
    const match = code.match(/some tricky regex which extracts the data you want/); // we run our regex in the browser
    return match; // and only return the results
});

This will only return the parts of the code we need, which can then be fruther processed from within Node.js.

Independent of which approach you choose, both ways are much better and more secure than running unknown code inside your main thread. If you absolutely have to process the data in your Node.js environment, use a regular expression for it like shown in the answer from trincot. You should never use eval to run untrusted code.

trusktr · Answer

I think using an AST generator like Esprima or other AST tools is the easiest way to read and work with source code.

Honestly, if you figure out how to run Esprima, and generate a "Abstract Syntax Tree" from the source code, you will find it surprisingly easy and simple to read the generated tree structure that represents the code you just parsed, and you'll find it surprisingly easy to read the information, and convert it into anything you want.

It may seem daunting at first, but honestly, it is not. You'll be surprised: AST tools like Esprima were made exactly for purposes similar to what you are trying to do, in order to make the job easy.

AST tools are born from years worth of research at how to read and manipulate source code, so I highly recommend them.

Give that a try!

To help you understand what various ASTs look like, you can look at https://astexplorer.net. It is super useful for knowing how AST tree structures from various tools look like.

Oh, one last thing! In order to traverse an AST tree, you can use something like https://github.com/estools/estraverse. It will make life easy.

How to get JavaScript object in JavaScript code?

Tags:

javascript

node.js

puppeteer

web-crawler

TL;DR

Problem

What I did

Wonjun Kim

2 Answers

Short answer: Don't (re)build a parser in Node.js, use the browser instead

Example

Bad code

Option 1: Inject proxy functions into the page

Option 2: Parse the data

Thomas Dondorf

trusktr

Recent Activity

Donate For Us

How to get JavaScript object in JavaScript code?

Tags:

javascript

node.js

puppeteer

web-crawler

TL;DR

Problem

What I did

Wonjun Kim

2 Answers

Short answer: Don't (re)build a parser in Node.js, use the browser instead

Example

Bad code

Option 1: Inject proxy functions into the page

Option 2: Parse the data

Thomas Dondorf

trusktr

Related questions

Recent Activity

Donate For Us