Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to detect web crawlers for SEO, using Express?

I've been searching for npm packages but they all seem unmaintained and rely on the outdated user-agent databases. Is there a reliable and up-to-date package out there that helps me detect crawlers? (mostly from Google, Facebook,... for SEO) or if there's no packages, can I write it myself? (probably based on an up-to-date user-agent database)

To be clearer, I'm trying to make an isomorphic/universal React website and I want it to be indexed by search engines and its title/meta data can be fetched by Facebook, but I don't want to pre-render on all normal requests so that the server is not overloaded, so the solution I'm thinking of is only pre-render for requests from crawlers

like image 703
KwiZ Avatar asked Jan 07 '16 04:01

KwiZ


People also ask

How do I identify a web crawler?

There are two methods of verifying the IP: Some search engines provide IP lists or ranges. You can verify the crawler by matching its IP with the provided list. You can perform a DNS look up to connect the IP address to the domain name.

How do I identify a Google crawler?

Alternatively, you can identify Googlebot by IP address by matching the crawler's IP address to the list of Googlebot IP addresses. For other Google IP addresses from where your site may be accessed (for example, by user request or Apps Scripts), match the accessing IP address against the list of Google IP addresses.

How does Web crawler work in SEO?

How do web crawlers work? A web crawler works by discovering URLs and reviewing and categorizing web pages. Along the way, they find hyperlinks to other webpages and add them to the list of pages to crawl next. Web crawlers are smart and can determine the importance of each web page.

How do you detect search engine bots?

Verifying Googlebot the only official supported way to identify a google bot is to run a reverse DNS lookup on the accessing IP address and run a forward DNS lookup on the result to verify that it points to accessing IP address and the resulting domain name is in either googlebot.com or google.com domain.


1 Answers

I found this isbot package that has the built-in isbot() function. It seams to me that the package is properly maintained and that they keep everything up-to-date.

USAGE:

const isBot = require('isbot');

...

isBot(req.get('user-agent'));

Package: https://www.npmjs.com/package/isbot

like image 150
NeNaD Avatar answered Oct 05 '22 23:10

NeNaD