Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Browser-based client-side scraping

I wonder if its possible to scrape an external (cross-domain) page through the user's IP?

For a shopping comparison site, I need to scrape pages of an e-com site but several requests from the server would get me banned, so I'm looking for ways to do client-side scraping — that is, request pages from the user's IP and send to server for processing.

like image 316
eozzy Avatar asked Jul 23 '15 07:07

eozzy


People also ask

What is browser scraping?

Web scraping is the process of using bots to extract content and data from a website. Unlike screen scraping, which only copies pixels displayed onscreen, web scraping extracts underlying HTML code and, with it, data stored in a database. The scraper can then replicate entire website content elsewhere.


3 Answers

No, you won't be able to use the browser of your clients to scrape content from other websites using JavaScript because of a security measure called Same-origin policy.

There should be no way to circumvent this policy and that's for a good reason. Imagine you could instruct the browser of your visitors to do anything on any website. That's not something you want to happen automatically.

However, you could create a browser extension to do that. JavaScript browser extensions can be equipped with more privileges than regular JavaScript.

Adobe Flash has similar security features but I guess you could use Java (not JavaScript) to create a web-scraper that uses your user's IP address. Then again, you probably don't want to do that as Java plugins are considered insecure (and slow to load!) and not all users will even have it installed.

So now back to your problem:

I need to scrape pages of an e-com site but several requests from the server would get me banned.

If the owner of that website doesn't want you to use his service in that way, you probably shouldn't do it. Otherwise you would risk legal implications (look here for details).

If you are on the "dark side of the law" and don't care if that's illegal or not, you could use something like http://luminati.io/ to use IP adresses of real people.

like image 190
Johann Bauer Avatar answered Oct 17 '22 01:10

Johann Bauer


Basically browsers are made to avoid doing this…

The solution everyone thinks about first:

jQuery/JavaScript: accessing contents of an iframe

But it will not work in most cases with "recent" browsers (<10 years old)

Alternatives are:

  • Using the official apis of the server (if any)
  • Try finding if the server is providing a JSONP service (good luck)
  • Being on the same domain, try a cross site scripting (if possible, not very ethical)
  • Using a trusted relay or proxy (but this will still use your own ip)
  • Pretends you are a google web crawler (why not, but not very reliable and no warranties about it)
  • Use a hack to setup the relay / proxy on the client itself I can think about java or possibly flash. (will not work on most mobile devices, slow, and flash does have its own cross site limitations too)
  • Ask google or another search engine for getting the content (you might have then a problem with the search engine if you abuse of it…)
  • Just do this job by yourself and cache the answer, this in order to unload their server and decrease the risk of being banned.
  • Index the site by yourself (your own web crawler), then use your own indexed website. (depends on the source changes frequency) http://www.quora.com/How-can-I-build-a-web-crawler-from-scratch

[EDIT]

One more solution I can think about is using going through a YQL service, in this manner it is a bit like using a search engine / a public proxy as a bridge to retrieve the informations for you. Here is a simple example to do so, In short, you get cross domain GET requests

like image 24
Flavien Volken Avatar answered Oct 17 '22 03:10

Flavien Volken


Have a look at http://import.io, they provide a couple of crawlers, connectors and extractors. I'm not pretty sure how they get around bans but they do somehow (we are using their system over a year now with no problems).

like image 3
Jan Avatar answered Oct 17 '22 03:10

Jan