Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I scrape sites that require authentication using node.js?

Tags:

I've come across many tutorials explaining how to scrape public websites that don't require authentication/login, using node.js.

Can somebody explain how to scrape sites that require login using node.js?

like image 810
ekanna Avatar asked Jan 04 '12 11:01

ekanna


People also ask

Is Nodejs good for web scraping?

Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.


2 Answers

Use Mikeal's Request library, you need to enable cookies support like this:

var request = request.defaults({jar: true}) 

So you first should create a username on that site (manually) and pass the username and the password as params when making the POST request to that site. After that the server will respond with a cookie which Request will remember, so you will be able to access the pages that require you to be logged into that site.

Note: this approach doesn't work if something like reCaptcha is used on the login page.

like image 97
alessioalex Avatar answered Sep 20 '22 19:09

alessioalex


I've been working with NodeJs Scrapers for more than 2 years now

I can tell you that the best choice when dealing with logins and authentication is to NOT use direct request

That is because you just waste time on building manual requests and it is way slower,

Instead, use a high lever browser that you control via an API like Puppeteer or NightmareJs

I have a good starter and in-depth guide on How to start scraping with Puppeteer, I'm sure it will help!

like image 27
Fabian Avatar answered Sep 19 '22 19:09

Fabian