Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to crawl all the internal url's of a website using crawler?

I wanted to use a crawler in node.js to crawl all the links in a website (internal links) and get the title of each page , i saw this plugin on npm crawler, if i check the docs there is the following example:

var Crawler = require("crawler");

var c = new Crawler({
   maxConnections : 10,
   // This will be called for each crawled page
   callback : function (error, res, done) {
       if(error){
           console.log(error);
       }else{
           var $ = res.$;
           // $ is Cheerio by default
           //a lean implementation of core jQuery designed specifically for the server
           console.log($("title").text());
       }
       done();
   }
});

// Queue just one URL, with default callback
c.queue('http://balenol.com');

But what i really want is to crawl all the internal urls in the site and is the inbuilt into this plugin or does this need to be written seperately ? i don't see any option in the plugin to visit all the links in a site , is this possible ?

like image 759
Alexander Solonik Avatar asked May 03 '18 11:05

Alexander Solonik


People also ask

Can you crawl any website?

Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch. Startups love it because it's a cheap and powerful way to gather data without the need for partnerships.

How to crawl a web page with a Python crawler?

In the process of crawling web pages with a Python crawler, the first step is to crawl the URL. If there are many URLs on the web page, how to crawl all these URLs? This article describes two methods for a Python crawler to crawls all URLs on a Web page. 1. Methods To Crawls All URLs On A Web Page. Use BeautifulSoup to extract all URLs quickly.

How do web crawlers discover new pages?

Once web crawlers start crawling a page, they discover new pages via links. These crawlers add newly discovered URLs to the crawl queue so that they can be crawled later.

What is web crawling?

That’s what Web crawling is going to a page then finding more links on that page, follow them and crawl data. But first we need to learn URL parsing which will help the program to decide if it is an internal links or external link.

What types of url sources can I include in Deepcrawl projects?

There are seven types of URL sources you can include in your Deepcrawl projects. Consider running a crawl with as many URL sources as possible, to supplement your linked URLs with XML Sitemap and Google Analytics, and other data. Web crawl: Crawl only the site by following its links to deeper levels.


2 Answers

The following snippet crawls all URLs in every URL it finds.

const Crawler = require("crawler");

let obselete = []; // Array of what was crawled already

let c = new Crawler();

function crawlAllUrls(url) {
    console.log(`Crawling ${url}`);
    c.queue({
        uri: url,
        callback: function (err, res, done) {
            if (err) throw err;
            let $ = res.$;
            try {
                let urls = $("a");
                Object.keys(urls).forEach((item) => {
                    if (urls[item].type === 'tag') {
                        let href = urls[item].attribs.href;
                        if (href && !obselete.includes(href)) {
                            href = href.trim();
                            obselete.push(href);
                            // Slow down the
                            setTimeout(function() {
                                href.startsWith('http') ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`) // The latter might need extra code to test if its the same site and it is a full domain with no URI
                            }, 5000)

                        }
                    }
                });
            } catch (e) {
                console.error(`Encountered an error crawling ${url}. Aborting crawl.`);
                done()

            }
            done();
        }
    })
}

crawlAllUrls('https://github.com/evyatarmeged/');
like image 110
Evyatar Meged Avatar answered Oct 17 '22 12:10

Evyatar Meged


In the above code, just change the following to get the internal links of a website...

from

href.startsWith('http') ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`)

to

href.startsWith(url) ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`)
like image 20
syedshabbir Avatar answered Oct 17 '22 10:10

syedshabbir