Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

is it possible to write web crawler in javascript?

I want to crawl the page and check for the hyperlinks in that respective page and also follow those hyperlinks and capture data from the page

like image 687
Ashwin Mendon Avatar asked Jun 18 '12 13:06

Ashwin Mendon


People also ask

How do you crawl a website in JavaScript?

To crawl a JavaScript website, open up the SEO Spider, click 'Configuration > Spider > Rendering' and change 'Rendering' to 'JavaScript'.

What is web crawler in JavaScript?

A web crawler is a program, often called a bot or robot, which systematically browses the Web to collect data from webpages. Typically search engines (e.g. Google, Bing, etc.) use crawlers to build indexes.

How do you code a web crawler?

Here are the basic steps to build a crawler: Step 1: Add one or several URLs to be visited. Step 2: Pop a link from the URLs to be visited and add it to the Visited URLs thread. Step 3: Fetch the page's content and scrape the data you're interested in with the ScrapingBot API.

Can JavaScript be ignored by web crawlers?

Most of them don't handle Javascript in any way. (At least, all the major search engines' crawlers don't.) This is why it's still important to have your site gracefully handle navigation without Javascript.


2 Answers

Generally, browser JavaScript can only crawl within the domain of its origin, because fetching pages would be done via Ajax, which is restricted by the Same-Origin Policy.

If the page running the crawler script is on www.example.com, then that script can crawl all the pages on www.example.com, but not the pages of any other origin (unless some edge case applies, e.g., the Access-Control-Allow-Origin header is set for pages on the other server).

If you really want to write a fully-featured crawler in browser JS, you could write a browser extension: for example, Chrome extensions are packaged Web application run with special permissions, including cross-origin Ajax. The difficulty with this approach is that you'll have to write multiple versions of the crawler if you want to support multiple browsers. (If the crawler is just for personal use, that's probably not an issue.)

like image 133
apsillers Avatar answered Oct 12 '22 11:10

apsillers


If you use server-side javascript it is possible. You should take a look at node.js

And an example of a crawler can be found in the link bellow:

http://www.colourcoding.net/blog/archive/2010/11/20/a-node.js-web-spider.aspx

like image 22
Bogdan Emil Mariesan Avatar answered Oct 12 '22 12:10

Bogdan Emil Mariesan