how to crawl all the internal url's of a website using crawler?

Tags:

I wanted to use a crawler in node.js to crawl all the links in a website (internal links) and get the title of each page , i saw this plugin on npm crawler, if i check the docs there is the following example:

var Crawler = require("crawler");

var c = new Crawler({
   maxConnections : 10,
   // This will be called for each crawled page
   callback : function (error, res, done) {
       if(error){
           console.log(error);
       }else{
           var $ = res.$;
           // $ is Cheerio by default
           //a lean implementation of core jQuery designed specifically for the server
           console.log($("title").text());
       }
       done();
   }
});

// Queue just one URL, with default callback
c.queue('http://balenol.com');

But what i really want is to crawl all the internal urls in the site and is the inbuilt into this plugin or does this need to be written seperately ? i don't see any option in the plugin to visit all the links in a site , is this possible ?

759

asked May 03 '18 11:05

Alexander Solonik

2 Answers

The following snippet crawls all URLs in every URL it finds.

const Crawler = require("crawler");

let obselete = []; // Array of what was crawled already

let c = new Crawler();

function crawlAllUrls(url) {
    console.log(`Crawling ${url}`);
    c.queue({
        uri: url,
        callback: function (err, res, done) {
            if (err) throw err;
            let $ = res.$;
            try {
                let urls = $("a");
                Object.keys(urls).forEach((item) => {
                    if (urls[item].type === 'tag') {
                        let href = urls[item].attribs.href;
                        if (href && !obselete.includes(href)) {
                            href = href.trim();
                            obselete.push(href);
                            // Slow down the
                            setTimeout(function() {
                                href.startsWith('http') ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`) // The latter might need extra code to test if its the same site and it is a full domain with no URI
                            }, 5000)

                        }
                    }
                });
            } catch (e) {
                console.error(`Encountered an error crawling ${url}. Aborting crawl.`);
                done()

            }
            done();
        }
    })
}

crawlAllUrls('https://github.com/evyatarmeged/');

110

answered Oct 17 '22 12:10

Evyatar Meged

In the above code, just change the following to get the internal links of a website...

from

href.startsWith('http') ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`)

href.startsWith(url) ? crawlAllUrls(href) : crawlAllUrls(`${url}${href}`)

answered Oct 17 '22 10:10

syedshabbir

Related questions
                            
                                How to enable auditing and log all CRUD operations in MongoDB Node App?
                            
                                How to authenticate to Google Analytics Reporting API v4
                            
                                How to configure application to be available subdomain cookie?
                            
                                Why can't V8 optimize try-catch-finally?
                            
                                Mongodb get the 3-byte counter from an ObjectId
                            
                                Cannot finde module 'express' (node app with docker)
                            
                                Nodejs ip address result ::1
                            
                                node server running but localhost refusing to connect
                            
                                webRTC in node.js
                            
                                React Native Syntax Error: Unexpected Token, expected }
                            
                                AWS - nodejs SDK - CognitoIdentityServiceProvider.initiateAuth - CredentialsError: Missing credentials in config
                            
                                Uploading multiple files to AWS S3 using NodeJS
                            
                                pm2, node, instances and ports
                            
                                Typescript - Cannot find name 'fetch' (universal library)
                            
                                Promise.all() not resolving when running server - otherwise works fine
                            
                                How to capture query string parameters from network tab programmatically
                            
                                mocking server for SSR react app e2e tests with cypress.io
                            
                                Webpack configuration for AWS Lambda?
                            
                                Execute a module immediately after calling import in ES6 [duplicate]
                            
                                Using node.js crypto to verify signatures

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to crawl all the internal url's of a website using crawler?

Tags:

node.js

web-crawler

Alexander Solonik

People also ask

2 Answers

Evyatar Meged

syedshabbir

Recent Activity

Donate For Us