Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)

What are the best options for performing Web Scraping of a not currently open tab from within a Google Chrome Extension with JavaScript and whatever more technologies are available. Other JavaScript-libraries are also accepted.

The important thing is to mask the scraping to behave like a normal web-request. No indications of AJAX or XMLHttpRequest, like X-Requested-With: XMLHttpRequest or Origin.

The scraped content must be accessible from JavaScript for further manipulation and presentation within the extension, most probably as a string.

Are there any hooks in any WebKit/Chrome-specific API:s that can be used to make a normal web-request and get the results for manipulation?

var pageContent = getPageContent(url); // TODO: Implement var items = $(pageContent).find('.item'); // Display items with further selections 

Bonus-points to make this work from a local file on disk, for initial debugging. But if that is the only point is stopping a solution, then disregard the bonus-points.

like image 217
Seb Nilsson Avatar asked Jun 28 '11 14:06

Seb Nilsson


People also ask

Can you use JavaScript for web scraping?

Whether it's a web or mobile application, JavaScript now has the right tools. This article will explain how the vibrant ecosystem of NodeJS allows you to efficiently scrape the web to meet most of your requirements.


1 Answers

Attempt to use XHR2 responseType = "document" and fall back on (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type")) with my text/html patch. See https://gist.github.com/1138724 for an example of how I detect responseType = "document support (synchronously checking response === null on an object URL created from a text/html blob).

Use the Chrome WebRequest API to hide X-Requested-With, etc. headers.

like image 101
Eli Grey Avatar answered Oct 06 '22 09:10

Eli Grey