Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scrape a website which requires login using python and beautifulsoup?

If I want to scrape a website that requires login with password first, how can I start scraping it with python using beautifulsoup4 library? Below is what I do for websites that do not require login.

from bs4 import BeautifulSoup     import urllib2  url = urllib2.urlopen("http://www.python.org")     content = url.read()     soup = BeautifulSoup(content) 

How should the code be changed to accommodate login? Assume that the website I want to scrape is a forum that requires login. An example is http://forum.arduino.cc/index.php

like image 398
guagay_wk Avatar asked Apr 16 '14 07:04

guagay_wk


People also ask

Can you scrape a website that requires login?

Web Scraping Past Login ScreensParseHub is a free and powerful web scraper that can log in to any site before it starts scraping data. You can then set it up to extract the specific data you want and download it all to an Excel or JSON file. To get started, make sure you download and install ParseHub for free.

Is web scraping with Python legal?

Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don't hold the right to share is illegal.


1 Answers

You can use mechanize:

import mechanize from bs4 import BeautifulSoup import urllib2  import cookielib ## http.cookiejar in python3  cj = cookielib.CookieJar() br = mechanize.Browser() br.set_cookiejar(cj) br.open("https://id.arduino.cc/auth/login/")  br.select_form(nr=0) br.form['username'] = 'username' br.form['password'] = 'password.' br.submit()  print br.response().read() 

Or urllib - Login to website using urllib2

like image 108
4d4c Avatar answered Sep 22 '22 20:09

4d4c